STATETIOAL analysis 


IN BIOLOGY 



By the Same AxUhor 

THE MEASUREMENT OF LINKAGE 

IN HEREDITY 



STATISTICAL ANALYSIS 

IN BIOLOGY 

by 

K. MATHER 

D.Sc., Ph.D. 

Professor of Genetics 
in the University of Birmingham 


With a Foreword by 
R. A. FISHER 

Sc.D.. F.R.S. 

Arthur Balfour Professor, University of Cambridge 


METHUEN & CO. LTD. LONDON 

36 ESSEX STREET, STRAND, W.C.2 



First published January 14, 1943 
Second Edition, Revised, December 1946 
Third Edition August 1949 
Fourth Edition 1951 
Reprinted 1960 




\ % -n. 


^ 1^3^ ^‘7 0 •• s 


« t 



ALLAKQ IQBAL LIBRARY 



49932 

$ 



CATA1.00UB NO. 3832/t7 


PRINTED AND BOUND IN OBKAT BRITAIN BY 
BUTLER AND TANNER DTD.» FROUB AND LONDON 




FOREWORD 


ONE of the most encouraging features about the modem statis¬ 
tical methods, which have developed in this country during the 
last fifteen years, is the readiness with which they are applied in 
aid of practical research. The enterprise and facility with which 
the younger biologists in particular have exploited these methods, 
and the stimulus they have found in them for the advancement 
of their own studies, constitute the real proof of their value. 
In a well-designed experiment every feature of the analysis has 
a meaning relevant to the understanding of the situation which 
the experiment was intended to explore. The earlier statistical 
literature, on the other hand, abounds in mathematical artefacts 
the interpretation of which is as ambiguous as their calculation 
is tortuous and indirect. At the present time it is the elementary 
expositions which suffer most from this academic tradition, for 
these are naturally more timid and imitative than more advanced 
works. 

Dr. Mather has already illustrated the power of these methods 
in genetic analysis, in his work on The Measurement of Linkage 
in Heredityf a book which every geneticjst would be wise to have 
by him. The present work, designed aS 'a more general introduc¬ 
tion to statistical methods for biological investigators, shows the 
same practical grasp of the essentials of good experimentation, 
and the same deliberate avoidance of what is extraneous. It is 
very simply written, and by well-chosen examples exhibits every 
step of the processes needed. The careful reader should rapidly 
acquire a repertoire of techniques appropriate to very varied 
circumstances. 

R. A, FISHER 


X 



This impatience was very foolish, and in 
after-years I have deeply regretted that I did 
not proceed far enough at least to understand 
something of the great leading principles of 
mathematics, for men thus endowed seem to 
have an extra sense. 

CHARLES DARWIN 


General impressions are never to be trusted. 


FRANCIS GALTON 



PREFACE 


STATISTICS is the concern of two different groups of scientist. 
The first group, of mathematical statisticians, is interested in 
developing the theory and extending the applicability of their 
subject, while the second group, which consists of non-mathe¬ 
maticians, is concerned largely with using the methods already 
available as tools in their owm researches. Among this latter 
group biologists are forced by the peculiarities of their experi¬ 
mental material to occupy a leading position ; for it is very 
rarely that the full value of a biological experiment can be 
realized before the observations have been subjected to a suitable 
statistical analysis. 

This separation into two groups, which might be termed 
respectively the makers of statistics and the users of statistics, 
is not, of course, complete. The mathematician must be able 
to appreciate the problems met by the users of his product, or 
his work will be sterile. Similarly, the biologist must have 
a sufficient knowledge of statistical theory to know how far 
present-day methods will take him and at what point he must 
turn to the statistician for further help and advice. That the 
full development of the subject is dependent on such co-operation 
is amply shown by the great advances which have resulted from 
the association of R. A. Fisher and his school with the agronomical 
and other biological research carried out at the Rothamsted 
Experimental Station and elsewhere. 

As a non-mathematician I cannot pretend to more than a 
passing acquaintance with those branches of statistics which are 
peculiarly the province of the mathematician. But as a biologist 
I can appreciate both the necessity for a better understanding 
of the potentialities of statistics amongst experimentalists, and 
the difficulties they feel in approaching a science which has a 
theory and language so unlike their own. Much has been done 
by R. A. Fisher’s two books, Statistical Methods for Research 
Workers and The Design of Experimentsy to assist biologists in 
this direction, and I hope the present work will also be of material 
help in a somewhat different way. I have concentrated on 
tr3dng to show the scope of the various methods, how they are 
interrelated and how they f ulfil the conditions necessary for 
satisfactory analysis. 


8 



4 


STATISTICAL ANALYSIS IN BIOLOGY 


This treatment will, no doubt, be subjected to two types of 
criticism. First of all, the mathematician may object that it is 
unsound, as, while resting on a mathematical basis, it avoids 
proof of the various propositions and distributions used. To 
him I would reply that such proofs are not difficult to find 
elsewhere and that, in any case, the need of the biologist is not 
so much to understand the construction of these distributions as 
to obtain a knowledge of the extent to which they help him and 
of how he should set about using them. His needs are not those 
of the mathematician. The second criticism may be made by 
the biologist who maintains that the book is of too advanced a 
mathematical standard. To defend myself against this attack I 
must point out two things, viz. that simple algebraic manipulation 
and elementary differential calculus are taught as routine to 
many school-children, and that as a matter of experience I have 
found this very treatment helpful in teaching statistics both to 
myself and to others during the last seven years or so. The main 
work involved is that of learning the vocabulary and grammar of 
the s5Tnbol language with which ideas are expressed and developed 
mathematically. This is, after all, no more difficult than acquir¬ 
ing a working knowledge of some spoken language. Indeed, to 
many of us it is easier, as there are no exceptions to be memorized. 
I have tried to lessen some of the labour by gradually working 
up to the more mathematical parts, which are mainly in the 
later chapters. 

Statistics and microscopy occupy a very similar position in 
biology. Both have been, and are, the subject of study by 
non-biological specialists, yet the use of existing facilities in both 
is necessary for the full development of the biological sciences. 
Their difference lies in the fact that biologists are early introduced 
to the use of the microscope, but are, at present, forced to pick 
up a knowledge of statistics for themselves when they have, 
perhaps somewhat reluctantly, found it to be necessary, A 
modification of biological teaching to include the elements of 
statistical analysis would have a profound and very beneficial 
effect. 

Many sample analyses are given as examples in the text. 
The questions at issue are often somewhat trivial; they have 
been chosen solely with the object of illustrating the argument. 
The data treated have been taken from many fields of research 
but inevitably will show a preponderance from those branches, 
especially my own subject of genetics, with which I am familiar. 



PREPACK 


5 


Whatever the nature of the data, the examples should be worked 
through carefully by anyone wishing to derive full benefit from 
the book, as it is the treatment, not the subject matter, which 
is of importance. 

I have to thank Dr, C. H. Cadman and Mr. L. G. Wigan for 
reading and criticizing the text. Miss B. Schafer has been of great 
assistance in preparing the manuscript and the late Mr. H. C. 
Osterstock is responsible for drawing most of the diagrams. I 
am especially indebted to Professor R. A. Fisher for his advice, 
not only in the preparation of this book, but also on the many 
occasions when he has helped me over troublesome difficulties. 

Tables I-IV are abridged from Statistical Tables for Biological^ 
Agricultural and Medical Research, by R. A. Fisher and F. Yates, 
with the kind permission of the authors and of the pubhshers, 
Messrs. Oliver and Boyd. When fuller versions of these tables 
are necessary, reference should be made to the originals. 

October 1942 K. M. 


PREFACE TO THE SECOND EDITION 

DURING the period which has elapsed since the appearance of 
the first edition, my attention has been drawn to a number of 
misprints, ambiguities and inaccuracies which it contained. 
I wish to thank all those correspondents, especially Mr. K. 
Williams, who have helped me in this way. I trust that, as 
a consequence, this edition will be less open to misunderstanding. 

A number of suggestions have also been made to me in regard 
to problems and techniques which received no mention in the 
first edition, and yet which might increase the usefulness of the 
book and render it of value to a wider pubhc. While sympathiz¬ 
ing with many of them, I have found it impossible to include them 
all if an unjustifiable expansion of the book was to be avoided. 
A thirteenth chapter has, however, been added. It includes 
accounts of the angular and probit transformations. These were 
chosen not merely as illustrating the widely used and special 
techniques of toxicology and biological assay, but also to show 
the general value of transformations in rendering data more 
manageable. 

July 1945 


K. M. 



CONTENTS 


CHAPTER ^AQl 

1 INTRODUCTORY 9 

1. THE NATURE OF STATISTICS. 2. POPULATIONS AND SAMPLES. 

3. DIAGRAMS AND GRAPHS 

n PROBABILITY AND SIGNIFICANCE 14 

4. SIMPLE PROBABILITY. 5. COMPOUND PROBABILITY. 0. AGREE¬ 
MENT WITH HYPOTHESIS. 7. SIGNIFICANCE 

m DISTRIBUTIONS 26 


8. THE NORMAL DISTRIBUTION. 9. THE MEAN AND STANDARD 
DEVIATION. 10. FITTING THE NORMAL CURVE. 11. SKEWNESS 
AND KURTOSIS. 12. THE POISSON SERIES. 13. THE MEAN AND 
VARIANCE OF THE BINOMIAL DISTRIBUTION 

IV TESTS OF SIGNIFICANCE 41 

14. THE NORMAL DEVUTB. 15. THE t DISTRIBUTION. 18. THE 
« DISTRIBUTION. 17. THE x* DISTRIBUTION. 18. THE INTERRELA¬ 
TIONS OF e, t, X* AND t, AND THEIR USB IN ANALYSIS 


V THE SIGNIFICANCE OF SINGLE OBSERVATIONS, SUMS, 60 
DIFFERENCES AND MEANS 

19. SINGLE OBSERVATIONS. 20. THE VARIANCE OF SUMS, 
DIFFERENCES AND MEANS. 21. THE 8IONIFICANCB OF MEANS 
AND DIFFERENCES OF MEANS 


VI DEGREES OF FREEDOM AND THE ANALYSIS OF VARIANCE 61 

22. THE INDIVIDUALITY OP DEGREES OF FREEDOM. 23. THE 
PRINCIPLES OF PARTITION. 24. THE ANALYSIS OF VARIANCE. 

25. INTERACTIONS BETWEEN MAIN EFFECTS. 26. INCOMPLETE 
ANALYSIS 


Vn PLANNING EXPERIMENTS 86 

27. THE FACTORIAL EXPERIMENT. 28. AN EXPERIMENT WITH 
THREE FACTORS. 29. THE CONTROL OP ERROR. 30. CON* 
FOUNDING 


Vm THE INTERRELATIONS OF TWO VARIABLES 109 

31. LINEAR REGRESSION. 32. THE SAMPLING ERROR OF REGRES¬ 
SION CONSTANTS. 33. THE DIFFERENCE BETWEEN TWO 
REGRESSION COEFFICIENTS. 34. THE USE OF CONCOMITANT 
OBSERVATIONS 


IX 


POLYNOMIAL AND MULTIPLE REGRESSIONS 


35. TESTING LINEABITY OF REGRESSIONS. 86. THE CHOICE OP 
ORDER OF A POLYNOMIAL. 37. THE CALCULATION OP A POLY¬ 
NOMIAL REGRESSION. 38. REGRESSION ON TWO OR MORE 
VARIATES. 39. DISCRIMINANT FUNCTIONS 


129 


X CORRELATION 160 

40. INTER-CLASS CORRELATION. 41. THE COMBINATION OF 
INTER-CLASS CORRELATION COEFFICIENTS. 42. PARTIAL COR¬ 
RELATION. 43. INTRA-OLASS CORRELATION 

6 



CONTENTS 


7 


CHAPTBB 

XI 


xn 


xm 


THE ANALYSIS OF FEEQUENCY DATA 

44. X* AND THE NORMAL DEVIATE. 45. THE VARIOUS FORMS 
OK X*. 46. PARTITIONING x*- 47. THE EFFECT OF FITTING A PARA¬ 
METER. 48. HETEROGENEITY OP DATA. 49. THE 2x2 CON- 
TINOENCY TABLE. 50. THE 2xi TABLE. 51. THE GENERAL 
CONTINGENCY TABLE 


ESTIMATION AND INFORMATION 

52. PROBABILITY AND LIKELIHOOD. 53. THE METHOD OP MAXI¬ 
MUM LIKELIHOOD. 54. INEFFICIENT STATISTICS. 55. SI5CUL- 
TANEOUS ESTIMATION. 56. COMBINED ESTIMATION AND HETERO- 
GENEITY TESTS. 57. PLANNING EXPERIMENTS. 58. FIDUCUL 
PROBABILITY 


SOME TRANSFORMATIONS 

69. TEE ANGULAR TRANSFORMATION. 60. THE PROBIT TRANS¬ 
FORMATION 


GLOSSARY OF TERMS 


TABLES 


174 

203 

234 

253 

258 


INDEX 


265 




CHAPTER I 


INTRODUCTORY 

1. THE NATURE OF STATISTICS 

STATISTICS is the mathematics of experiment. Experiments 
are conducted with the object of answering some question or 
questions in which the experimenter is interested , but it is 
seldom that the answer can be seen before the results have 
been subjected to some form of analysis. For the results of an 
experiment, especially a biological experiment, commonly show 
the influence of many factors other than those whose investigation 
forms the reason of the research. Some of these disturbances 
may be traced to known or partiaUy known causes, but the 
majority are unaccountable and constitute sources of potential 
error in the interpretation of the results. The objects of 
statistical analysis are (i) to reduce the data, which because of 
their very bulk and complexity are incapable of bemg fully com¬ 
prehended by the mind of the experimenter, to a few easUy 
imderstood quantities containing most, if not all, of the imorma- 
tion relevant to the subject under investigation ; and (u) to assess 
the meaning and importance of these quantities whUe making due 
allowance for the errors caused by disturbing influents. 

Every experiment is of necessity limited in scope. Its results 
afford a sample of what would be observed if all the material 
of the particular kind under investigation was subjected to the 
special circumstances of the experimental procedure Yet the 
hypothesis which seeks to explain the resultmg observations 
must account equaUy well for the outcome of any of the many 
other possible experiments of the particular kind under con- 
si dAra+ion Thus if experimental science is not to be dismissed 
as meaningless, it must be held possible to ^aw conclusions of 
general vabdity from the results of necessarily restricted experi- 
ments i e. it must be held possible to argue from the particular 
to the general. To the extent that statistics is developed to 
facilitate such arguments it is the mathematics of inductive 
reasoning in contradistinction to other branches of mathematical 
science which are concerned with deducing the logical con¬ 
sequences of a given set of postulates This is not to say that 
statistical analyses never involve deduction. Not infrequently 
they do as when testing the adequacy of a given hypothesis to 
account for or explain particular observations. But hypotheses 
are seldom so precisely formulated as to be capable of being 
tested before the data have been made to supply, by inductive 

9 



10 


STATISTICAL ANALYSIS IN BIOLOGY 


reasoning, some quantity necessary to their full specification. 
Statistical operations reflect that alternation of induction and 
deduction which is so characteristic of experimental science. 

2. POPULATIONS AND SAMPLES 

The primary concept of statistics is that of the infinitely large 
hypothetical population of which the observed data form 
a sample. It is to the population, which is characterized and 
described by certain quantities, termed parameters, that the 
hypothesis with which the experimental results are being com¬ 
pared, or to whose specification they are contributing, applies. 
The population is the statistical way of representing the hypoth¬ 
esis, or rather its consequences, and the sample is the way of 
representing the observations. Thus suppose we consider the 
time-honoured example of tossing a coin. The hypothesis is 
that the coin is true or unbiased, i.e. that the chances of its 
showing a head or tail after tossing are equal. The population 
comprises an infinitely large number of tosses of the coin 
and, of course, exactly half the population shows heads and 
exactly half shows tails. The population is in fact characterized 
by the parameter where i is the proportion of heads, or 
tails. 

We sample the population by tossing the coin and observe 
either a head or a tail. This observation may be tested for 
agreement with hypothesis, i.e. we may undertake an analysis 
designed to determine whether it could reasonably be considered 
as a sample of the population which the hypothesis generates. 
Clearly in this case it can, as one or other result of tossing must 
be obtained. But we may toss the coin a large number of times, 
and then more possibilities would exist. If Ui heads and a* tails 
are observed, we know that Ui roughly should be equal to O 2 if 
the observations are really a random sample of this population. 
The degree to which Ui and a* depart from this expected equality 
affords a means of reaching a decision as to whether observation 
agrees with hypothesis or not. This operation is what we shall 
come to know as a test of significance. 

The data could, however, be used in another way. Just as 
the population is characterized by the parameter when ^ is 
the theoretical proportion of heads, the sample is characterized 

by the statistic a :=>—x being the observed proportion of 

heads. Now ^ and x are clearly related since, as Ui and fli 
increase, x must approach closer and closer towards i in value. 
But equally clearly x will not in general equal i as its value is 
subject to chance errors of sampling from the population. We 
must then regard x not as telling us the exact value of | but as 



POPULATIONS AND SAMPLES 


11 


affording an estimate of this value. This distinction between 
the parameter specifying the population and the statistic M'hich 
represents all that a sample can tell us about the parameter is 
very important in statistical theory and will be encountered again 
on many occasions. The statistic, or estimate of a parameter, 
is subject to sampling errors whose magnitude may, however, be 
calculated exactly, and so although it is impossible to find | 
from X it is possible to say rigorously, solely from what is known 
about X, that ^ lies within a given range of values with a given 
probability. 

The question of rigour in such cases is interesting. The 
conclusions drawn concerning | are not certain, but they are 
none the less rigorous as the nature and degree of the uncertainty 
can be stated exactly. This is equally true of conclusions drawn 
from a test of significance. Even though it may not be possible 
to say with certainty that an observation is or is not a sample 
from the population, it is possible to state exactly the degree of 
probability of either alternative being true. Uncertainty does 
not preclude rigour, provided that the degree of uncertainty is 
itself exactly known. 

Rigorous statements, even though they are uncertain state¬ 
ments, must be based solely on known facts, since it is only on 
this basis that the degree of uncertainty can be stated. In our 
example of estimating the statement of its value, or rather of 
the limits to its value, must be based only on the data contained 
in the observed sample, since these data represent the sum total 
of our accurate knowledge about the coin. It would be fatal to 
the rigour of the conclusions to attempt the introduction into 
the analysis of speculative controlling agents whose existence 
was unproven or whose effect was unmeasured. The idea, for 
example, that greater weight must be accorded to the possibility 
of as compared with any other single value, because the 
mint responsible for the coin is unlikely to send out a biased 
product, is useless. It contributes nothing accurate to our data. 
If, however, it is possible to give the range and frequency of 
the values of k shown by the coins emerging from the mint in 
question, the additional data are precise and they can be incor¬ 
porated in the calculation with profit. In doing this we are 
clearly not transgressing the rule that any statement about the 
parameter must be based solely on the data of observation, for 
the detailed knowledge of the distribution of bias in the mint’s 
coins must have originated from a series of precise observa¬ 
tions. 

The method of incorporating extra information of this 
particular type in the analysis is worth noting. It would be 
used to describe a super-population of coins each of which gives 



12 


STATISTICAL ANALYSIS IN BIOLOGY 

a population of throws like the one whose sample is discussed 
above. Clearly this super-population can only be adequately 
specified if the range and frequency of bias in coins are accurately 
known. In the absence of such knowledge the super-population 
cannot be used, though, as will be seen in Chapter XII, attempts 
have been made to employ it as a means of analysis by giving 
it characteristics based on a priori arguments. Such arguments 
introduce a subjective element into an analysis otherwise solely 
concerned with objective experimental data. In consequence 
they cannot be considered as anything but misleading. 

Thus statistical analysis must aim at making the data tell 
their own story in such a way that their true value and degree of 
t^stworthiness may be accurately assessed. Under-assessment 
of value 13 wasteful and over-assessment is misleading. Both 
must be avoided. Fisher’s co-ordination of his own analysis of 
the problem of estimation with the use of the exact tests of 
si^ficance derived by Pearson, ‘ Student ’ and himself make 

this aim capable of achievement by means that are illustrated 
in the succeeding chapters, 

3. DIAGRAMS AND GRAPHS 

Before turning to the discussion of the methods by which 
satisfactory statistical analyses are made, a word about diagrams 
and ^aphs is necessary. 

Diagrams and graphs share with statistical analyses the 
purpose of separating the relevant from the irrelevant in the data 
and presenting the results in such a way that this distinction is 
reasonably obvious. The methods of achieving this end are, 
however, very different in the two cases. In an analysis the 
procedure is that of reducing the data to a few numbers which 
have certam definite and known properties and which can hence 
be used as the basis of rigorous statements. A diagram is, on 
the other hand, merely a geometrical representation of the data, 
came out in such a way that any marked trends are obvious on 
^pection It can never replace statistical analysis because the 
mterpretation of a dia^am is essentially subjective. This is not, 
owever, say that ^agrams are useless ; on the contrary, they 
are mva ua e as adjuncts to statistical analysis, in two ways, 
e s place, when used as a preliminary to the analysis, 

ey se^e draw the attention of the experimenter to the 
sa’en ea ures of his data and frequently ensure that he does 
not overlook some unexpected relationship. In doing this a dia- 

suggest the form of analysis most appro- 

+ 1 ^ ^ diagrams are of value when used 

^bsequently to the analysis in order to display the results. 
Other persons mterested m the work can then understand the 



DIAGRAMS AND GRAPHS 13 

findings quickly and easily, in a way that non-geometrical repre¬ 
sentation seldom permits. 

The forms which diagrams and graphs can take are too 
numerous to be detailed here, though certain types will be 
illustrated later. No matter, however, what form a diagram 
assumes, it should always be made as simple as is compatible 
with the purpose in mind. Excessive elaboration defeats its 
own end, because it obscures the important features of the data 
both from the experimenter and from his subsequent readers. 


2 



CHAPTER ir 

PROBABILITY AND SIGNIFICANCE 

4. SIMPLE PROBABILITY 

ONE of the fundamental concepts of statistics is that of proba¬ 
bility. It is an idea with which most of us are familiar in a general 
way, but a precise definition of the term is necessary before it 
can be applied to the analysis of experimental data. 

When we speak of the probability of an event we imply that 
circumstances do not allow us to determine its regular occurrence, 
so that at a given opportunity the event may or may not happen ; 
but over an extensive series of opportunities it will take place 
in a characteristic proportion of cases. This proportion may be 
expressed numerically. Thus we can toss a penny and it may 
come down ‘ heads \ though on the other hand it may equally 
show ‘ tails Under the ordinary conditions of tossing it is not 
known which of these two possibilities will be realized by any 
spin. If, however, the penny is tossed a large number of times, 
assuming the absence of bias, we expect that the coin will show 
heads and tails in equal proportions of the trials. The probability 
of each result is then In general the probability of occurrence 
of an event may be defined in the following way : If, out of 
a very large number, n, of similar occasions on each of which 
a given event may occur, that event actually happens in a cases, 

the probability of the occurrence of the event is —. With a 

limited number of trials, such as must always content us in 
practice, the proportion of successes may not exactly equal 

but the larger n, and hence a, become, the more closely wiU 
n 

the ratio of number of successes to number of trials tend to 
approach the true probability. 

The probability of an event is sometimes predictable from the 
data given. Thus in the case of the character ‘ fern ’ leaf in 
Primula sinensis our previous experience is compatible with the 
belief that fern differs from the normal palm type by a single 
recessive gene. Fern is always homozygous £F, while palm may 
be Ff or FF. If a heterozygous palm is crossed with a fern 
(Ffxff), it is expected according to simple mendelian theory that 
half the progeny will have palm and half fern leaves. So the 
probability of any individual being fern is predictable firom 
previous knowledge of gene transmission. Such a forecast is, 
however, not always possible. Suppose that seed had been 

14 



COMPOTOD PROBABILITY 


15 


collected from a fern plant exposed to open pollination by both 
fern and palm individuals. In the absence of any knowledge as 
to the frequency of the various kinds of pollination and as to 
the constitution of the palm plants, i.e. whether homozygous or 
heterozygous, it would be impossible to predict the probability 
of any individual in the progeny being fern, because there are 
three possible types of cross, each giving different expectations. 
In such cases it is necessary to resort to estimation from experi¬ 
mental data in order to determine the probability. But the 
probability of any plant being fern is always a characteristic 
property of the particular sample of seed and may be found by 
suitable methods. 


5. COMPOUND PROBABILITY 

Just as the probability of one individual being of a given 
type may be determined, it is also possible to ascertain that of 
deling pairs or higher groups containing given numbers of 
individuals of each t 5 q)e. Consider the case of a family of five 
plants raised by back-crossing a heterozygous palm by fern 
(Ffxff). The probability of each plant being palm is ^ and of 
its being fern is also What is the probability that the family 
will consist of four palms and one fern ? 

Let us label the five plants, A, B, C, D and E. Two equally 
likely types of family may be distinguished according to whether 
A is palm or fern. Similarly the leaf form of B may be used to 
separate two types of family. Then since B’s leaves cannot be 
considered as determined by those of A, 2x2 types of family, all 
equally likely, can be recognized by the simultaneous leaf forms 
of these two plants. On extending this argument it is seen that 
there are 2 x 2 x 2 x 2 x 2 or 2 ® kinds of family all of equal probability, 
distinguishable when the leaf forms of all five plants are taken 
into account. The probability of any one of these families is 

clearly ^ or 2 ”®, So the problem reduces to that of finding 

what proportion of these 2® families consists of 4 palms and 
1 fern. The fern plant may be A, B, C, D or E, but once it has 
been decided which of the five is of this kind, the family is com¬ 
pletely specified, since all the rest must be palm. Hence there 
are only five different families which will fulfil the stated condition, 
viz. A' B^ C®' DF EF, AF B' CF DF af B®' C' Df EF 
AF BF QF D' EF and AF BF QF DF E^ where F is palm and f is 
fern. So the required probability is 5 out of 32, i.e., 3 '^ or 
0-16625. 

In the above example the chance of finding a family of four 
ferns and one palm would also be 0-15625, as five of the 32 equally 
likely families would fulfil this requirement. These five would 



16 STATISTICAL ANALYSIS IN BIOLOGY 


be specified by A, B. C, D or E being palm, the others being fern. 
If, however, the probabilities of any individual being palm and 
fern were not equal, the situation would be rather different, 
because the 32 types of family would not be expected equally 
often. Where, for example, the plants were from F 2 seed obtained 
by selfing a heterozygous palm (FfxFf) palm and fern would be 
expected with probabilities of J and J respectively. A would 
then be palm in | and fern in J of the cases, and similarly for B. 
The probability of both A and B being palm would be |xj, of 
A being palm and B fern JxJ, of A fern and B palm Jxf and of 
both being fern Jx By extending this argument it will be seen 
that the probability of obtaining any one of the different families 
which contain four palms and one fern would be There 

would again be five equally likely families answering to the 
requirement of four palms and one fern, so the total would be 
or 0*39551. It will also be seen that the chance of 
getting a family with four ferns and one palm is 5(|)(i)*, and so 
differs from its reciprocal. 

We may recognize two steps in the calculation of a compound 
probability. The first is the determination of the number of 
types of family that fulfil the given requirements. The second 
is that of finding the probabilities of each of these types. In 
the above examples there were five types of family, distinguish¬ 
able only by the order in which the leaf forms occurred and so 
all equally suitable. In the F* plants these five were all equally 
probable with an expectation of (|)*(i). The final probability 
was the product of these two figures. 

The further application of these rules may be illustrated by 
a somewhat more complicated example. What is the probability 
of a family of two palms and three ferns in the case of Fa plants ? 

To find the number of suitable families we first note that the 
family is specified when it is known which two of the five plants 
are palm. Let these two palms be distinguished as first and 
second. The first may be any one of five plants, but when this 
is fixed the possibilities for the second are more limited. It 
must be one of the remaining four plants. So the first palm 
may be assigned in five ways and the second in four. Further¬ 
more, there is no reason to believe that the first one determines 


the second, other than in limiting the number of plants which 
may fill this role. Hence the total number of ways of jointly 
assigning the two is 6x4, or 20. This is called the number of 
permutations of five taken two at a time and is written 5 P*. 

5 ! 

It may also be expressed in factorial notation as ^ - 1 , where 

(5—2)! 


6!=5x4x3x2xl and (6-2)!«3!»3x2xl. But of these 20 ways some 


are identical, for we have formally distinguished the palm.s as 



COMPOUND PROBABILITY 17 

first and second and so drawn a distinction between the cases, 

say, when A is the first palm and B the second and when B is 

the first palm and A the second. This distinction arises solel}^ 

from the method of approach and has no real meaning. A little 

consideration wiU show that the 20 ways consist of 10 such pairs. 

So the number of types of family with two palms and three ferns 
5x 4 

^ This is the number of combinations of five taken two 

at a time and is written as jC,. In the factorial notation it is 
5! 

(6—2)! 2!* should be noticed that gCa is the same as nCg, since 

6! 6! 5! 

(5-2)! 2r3! 2r3! (5-3)! 


The essential difference between permutations and combinations 

is that the former takes into account order, which the latter 
neglects. 

Having decided that there are 10 types of family, it is next 
necessary to find the probability of each. The probability that 
an individual will be palm is J, and so the chance of two being 
palm simultaneously is (J)2. Similarly the chance of finding 
three ferns is (J)® and so a family of two palms and three ferns 
has a probability of (■J)®(i)^ Multiplying this by the number of 
suitable types of family, which are clearly all equally likely, we 
get as the probability of a family of two palms and three ferns 

or lOx-j-^, i.e. 0*08789, There are also 10 families 
containing 3 palms and 2 ferns, as But these each have 

a probability of (f)^(i)* and so the total probability of a family 
of this type is 10(J)®(I)*. 

The expectations of obtaining five plants of which four, three 
two and one are palm have now been found. Only two further 
possibilities remain, viz. families with five palm or five fern plants. 
Following the same methods of calculation it can easily be shown 
that these two types of family are expected in (f)® and (J)^ of 
cases respectively. So the probabilities of the six types of 
family are : 


6F 

4F If 

3F 2f 

2F 3f 

lF4f 

6f 

Total 

(1)“ 





(!)• 

1-0 

0-23730 

0-39561 

0-26307 

008789 

001465 

0-00098 

1-0 


The second line of the table gives the probabilities in factorial 
notation, and the third line in decimals. 

These results can be arrived at by the simple expansion of the 
expression (f+i)®. Such a binomial series can be used to find 
the expectation of the various possible types of family for any 



18 


STATISTICAL ANALYSIS IN BIOLOGY 


case involving two alternative types. Where p is the probability 
that an individual will be of one type and q (=l-j?) that it will 
be of the other type, there being k individuals in the family, the 
appropriate binomial is (jp+qY. The chance of getting a family 
containing r individuals of the first type, the remaining k-r 
being of the second, is then 


k\ 


{k-r)\ r\ 


(?>)’■{?) 


k^f 


As an example of the use of the binomial expansion consider 
the case of families of six obtained by backcrossing heterozygous 
palm to fern (Ffxff). The two types, palm and fern, are expected 
with equal frequency in the next generation. So p=q=^i and k=Q. 
The series of expectations will be given by the expansion of 
These are : 


6F 5F If 

(i)* 

0015625 0 093750 


4F 2f 3F 3f 

0-234376 0-312500 


2F4f lF6f 

0-234375 0-093750 


6f 

( 4 )' 

001o626 


where the third line of figures is the evaluation of the second line. 

Where an individual may fall into one of more than two 
classes the expression whose expansion gives the series of proba¬ 
bilities is known as a multinomial. The general form is 

(Pl+P2+Pi+ . . . +Pjf 

The expectation of a family containing rj of the first kind, r, of 
the second, , , , rj of the yth kind is given by 


_^ 

r^! fj! r^! . 


—^iPiY^iPtY^iPiY* 

f m • 



6. AGREEMENT WITH HYPOTHESIS 

Given the probabilities of obtaining various kinds of family 
on any hypothesis, it is possible to determine how well or ill any 
observed family accords with that hypothesis. The series of 
calculated probabilities tells us the frequency with which to 
expect a resiilt like the one observed. It is then necessary to 
decide whether this frequency is sufficiently high to justify the 
belief that the data accord reasonably well with the primary 
hypothesis, or whether such a result would be expected so rarely 
that its occurrence must be taken to indicate that the hypothesis 
is invalid and incapable of explaining the observed data. This 
is the principle of all tests of significance. 

Example 1, Two plants of Primula ainensiSf one pin and 
the other thrum, were crossed together and a family of eleven 



AGREEMENT WITH HYPOTHESIS 


19 


plants grown from the seed so obtained. Of these eleven two 
were thrum and nine were pin. Does this agree with the 
hypothesis that pin differs from thrum by a single gene ? 

Now if the difference is due to a single gene the cross would 
be of the type Ssxss and each plant in the progeny would have 
a half chance of being thrum and a half of being pin. Then 
with a family of eleven plants the various types of segregation 
would be expected with frequencies given by the expansion of 

These are set out in Table 1. The chance of getting 
nine pins and two thrums is thus yfrsj or about 1 in 37. This 
tjrpe of family appears to be but rarely expected on the basis of 
a single gene difference and the hypothesis might be considered 
to be suspect. 

There is, however, one important drawback to the use of the 
expectation of the observed type of family for a direct test of 
this land, viz. that as the number of plants increases, the chance 
of getting any type of family, even though it be a very good fit 
with expectation, decreases rapidly. In a backcross, for example, 
we expect half the plants to be of each kind and so families with 
an exact 1 : 1 ratio are showing a perfect fit. Yet the chance 
of getting such a family diminishes as the size of the family grows 
larger (see Table 2). 

Thus the isolated probability of an observed type can be very 
misleading and cannot be employed in a test of significance. 
We use instead the probability of getting as bad or worse a fit 
on the hypothesis in question. In the case of the perfect 1 : 1 
ratios of Table 2 this is clearly constant at 1, no matter what 
the family size may be. 

In the segregation for pin and thrum any family containing 
nine, ten or eleven pins agrees with the hypothesis as badly as 
or worse than the progeny in question. But we have no reason 
to expect that the deviation from the perfect 1 : 1 shall occur in 
either the direction of more thrums or of more pins. So families 
with nine, ten or eleven thrums also agree as badly or worse with 
hypothesis than our observed progeny. Hence it is necessary to 
sum the probabilities of all six kinds of families containing nine 
or more plants of either kind. These arc shown in heavy type 

, . (1+11+55+55+11+1) 

in Table 1 and summation gives ^- Tinz -- 

required probability. This is 0*065, or about 1 in 15. When 
considered in this way agreement with h 3 q)othesis is not nearly 
so bad as it first appeared, and indeed would not generally be 
considered to be unduly poor. It is true that the probability 
is rather low, but if we were to take such data as contradicting 
hypothesis we should be deluding ourselves once in every fifteen 
trials. This is rather too often for most purposes, and so such 



20 


STATISTICAL ANALYSIS IN BIOLOGY 



Probability of getting a 

perfect 1 : 1 ratio . 0-600 0-3760 0-3125 0-2709 0-2461 0-1254 0-0889 





STGNinCANCE 


21 

a result would not be described as differing significantly from 
expectation. 


7. SIGNIFICANCE 

This question of the significance of results is one which 
frequently causes confusion. The probability of obtaining a fit 
as bad or worse may be calculated exactly; but a subjective 
decision is always involved when the meaning of the probability 
is considered. The level of probability which is considered to 
indicate a significant departure is really the level of admissible 
error, since the rejection of an hypothesis when it shows the data 
to have a probability of one in n means that it will be wrongly 
rejected once in n times. It seems to be generally agreed that 
a probability of 0-05, i.e. 1 in 20, indicates a suspiciously large 
departure from expectation, while 0-01 or 1 in 100 should be 
taken as showing a real discrepancy between the data and 
expectation. These are, however, not rules and the decision 
must always be dependent to some extent on the circumstances 
of the case. Where much hangs on the outcome of the test it 
might be desirable to adopt a more exacting standard, while if 
the decision has only trivial consequences a higher probability 
might be taken as significant. One rule can, however, be laid 
down. In presenting the results of any test of significance the 
probability itself should be given. The reader is then in a 
position to form his own opinion as to the justification of the 
acceptance or rejection of the hypothesis in question. 

In judging the significance of results all relevant information 
must be taken into account. Where an isolated set of data has 
a probability as low as 1 in 100 the departure from expectation 
would be considered to be real. But if 100 such sets of data 
are analysed and one of them shows a probability of 0-01 it 
clearly should not be taken as indicating a departure from 
hjrpothesis, because a fit as poor as this is expected once in 
100 sets of data. Table 3 gives the results of analysing 100 back- 
crosses for the gene determining yellow body colour in Drosophila 
melanogaster (Mather’s data). The probability of obtaining as 
bad or worse a fit has been calculated for each family separately 
and the families then classified according to these probabilities. 

Now the grand totals of yellow and not-yellow flies were 
5,273 and 6,329 respectively. These are in good agreement with 
the expected 1 : 1 ratio. But one family has a probability of 
less than 0-02. This would, if occurring individually, indicate a 
significant, or at least a suspiciously large, departure from 
expectation. In the present case we attach no significance to 
it, as not one but two families showing such poor agreement are 
expected. In other words, the evidence for agreement with 



TABLE 3 

DUtribuiwn of ProhabUities of finding a FU with Hypothesie as bad as or worse than that observed in 100 Backcrosses 

for Yellow Body Colour in Drosophila melanogaster 


22 


STATISTICAL ANALYSIS IN BIOLOGY 



Number observed 



SIQNIFICANCB 


23 


hypothesis is actually rather better than random sampling would 
lead xis to expect. It will be observed that, in general, the 
agreement between the number of families expected and the 
number observed to have probabilities between given levels is 
quite good in these data. 

The table might also be read in the reverse direction. For 
example, 18 families have a probabihty of 0-8 or higher. On 
the basis of random sampling we should expect 20 out of the 
100 to fall in this class. Agreement of observation and expecta¬ 
tion is again good. If too many families, say 60 out of 100, had 
shown a probability as high as or higher than 0-8, the h^othesis 
would have been just as suspect as if too many families had 
shown a very low probability. A ratio of 1 : 1 is indeed expected, 
but the theory of random sampling leads us also to expect 
a certain measure of departure from this perfect ratio in fimte 
samples, such as we must always use in practice. If this departure 
is not realized, something is wrong. Excessively good agreement 
might be due to selection of data, a very dubious and misleading 
practice, or to interdependence of the genotypes of the individuals 
in a family, in a way that would lead to reduced variability. 
Agreement with hypothesis demands that no exceptionally wide 
departures from expectation should be observed, but it also 
demands that a certain range of departure should be found 
according to the sampling technique used. Excessively good 
agreement is as much a disproof of hypothesis as is excessively 

bad agreement. , . ^ j 

There is a second way in which apparently significant depar¬ 
ture from expectation may be spuriously obtained, viz. by the 
appUcation of a number of different tests of significance to the 
same data. If 100 such tests are applied, one is expected to 
show a probability of as low as 0-01, and such a result must not 
be taken to indicate that the data really depart from expectation 
This may be iUustrated by the following case One hundred 
numbers, ranging from 0 to 9 inclusive, were taken from Fisher 
and Yates’s table of random numbers. Now if these numbers 
are really random, any pair of them should have a probability 
of 0-1 of being identical. Hence if each number is compared 
with its successor, the first being taken as the successor of the 
hundredth, one hundred comparisons will be made and ten 
identities are expected. Actually in the set of numbers used 
8 such identities were found. Similarly each number can be 
compared with the next but one, the next but two, and so on. 
In every case 10 identities are expected out of the hundred 
comparisons made. Twenty such tests were applied, and the 
probability of each result was calculated. The twenty probabili¬ 
ties are grouped, as in the case of the Drosophila data, and given 



24 


STATISTICAL ANALYSIS IN BIOLOGY 


in this form in Table 4. One of the twenty tests of randomness 
gave a result with the probability as low as 2%. This cannot, 
however, be judged significant, as we expect to get a single test 
showing a fit as bad as this in two out of every five trials, each 
of which includes twenty such tests. A perusal of the distribution 
of probabilities shows that they accord reasonably well with that 
expected on the basis of random sampling. There is a tendency 
to bunch in the 0-8-0-7 region, but this may be due in part to 
the use of only 100 numbers. There is no reason to doubt that 
these numbers were indeed random. 

So unless all the available information is taken into account, 
the result of a test of significance may be misleading. Several 
bodies of data can very often be combined into one test of 
significance, the decision as to the significance of departure then 
being more easily taken. Methods of doing this will be illustrated 
later. 

Finally, it must always be borne in mind that a hypothesis 
can never be completely disproved, still less proved, by the 
calculation of such a probability. It can only be rendered more 
or less likely, since however small the probability may be, such 
a result as the one observed is expected if sufficient trials are 
made, in spite of our surprise that it should occur in the very 
trial we ourselves conduct. 

REFERENCES 

FISHER, R. A., and YATES, F. 1943. Statistical Tables for Biological, 
Agricultural and Medical Research. Oliver and Boyd. Edinburgh. 
2nd ed. 



CHAPTER lit 

DISTRIBUTIONS 

8. THE NORMAL DISTRIBUTION 

THE binomial distribution where ^?+g'=l, can only be used 

as a basis for the analysis of data where the quantities p and 
k are known. In the genetical examples treated in the foregoing 
chapter these parameters are fixed by mendelian theory and the 
magnitude of the experiment respectively. 

Many kinds of data are, however, incapable of being treated 
in this way because they provide no basis for the fitting of 
numerical values to p and k. Thus the height of a maize plant 
is variable just as is the number of fern plants in a family of 
Primulas, but it is impossible to say how many factors are 
influencing the height. In other words, we do not know the 
value of k. Furthermore, we do not know how often or 
how far each factor affects the height of the plant, i.e. we 
do not know p. So it is impossible to find the curve of 
variation of such heights from a consideration of the binomial 
series. 

There is, however, a way of avoiding this difficulty. The 
number of factors influencing the height of a maize plant cannot 
be determined, but it is certainly large, for in addition to the 
many genes there will be innumerable environmental conditions, 
climatic, edaphic and biological, all playing their part in 
determining this one character. Now, provided that neither p 

nor q is of the magnitude of or less, as k becomes larger and 

iC 

larger the frequency distribution obtained by expanding {p-vq)^ 
approaches more and more closely to a curve which has the 
formula 

1 _(*z*fl* 

——e ^ 

ay/27i 

where m is the frequency of the class x. This limit to the binomial 
is called the Normal Curve of Errors, or for short, the Normal 
Curve. It will be observed that its formula contains neither 
p nor ky but is characterized by two new parameters, p and a. 
These two quantities are seldom fixed by h 5 q)othesis, but, unlike 
p and ky can easily be estimated from experimental data such as 
those concerning the height of maize plants. 

Before considering the estimation of p and it should be 

26 



STATISTICAL ANALYSIS IN BIOLOGY 


26 

pointed out that the normal curve is also the limit towards which 
the distribution 

(i5x+?l)(i>l£+?2)(i>3+5'3) - • . (Pfc+f/ft) 

tends. This means that it has a very wide application in covering 
the case where the k factors are exerting unequal influences on 
the character in question. Nor does this close the account of 
the usefulness of the normal curve, for it is the limit of the 
multinomial distribution too. As might then be expected, many 
sets of observed data have been found to conform to the normal 
curve. 

9. THE MEAN AND STANDARD DEVIATION 

The nature of the two quantities characteristic of the normal 
curve, viz. fj. and a, can perhaps be understood most easily from 
a geometrical representation. Fig. 1 shows a sample normal 
curve which, as will be observed, is continuous and symmetrical, 
unlike the binomial which is neither continuous nor in general 
symmetrical. The shape is such that there is a point of maximum 
slope on either side of the centre line, or axis of symmetry. The 
distance, along the abscissa, of this centre line from the origin of 
the axes is and the distance of each point of maximum slope 
from the centre line, also as measured along the abscissa, is a. 
Thus fixes the position of the curve and o determines its width. 

It has already been stated that these two quantities are 
seldom fixed by hypothesis, and that in consequence they must 
always be estimated from the experimental data themselves. 
Three main ways of finding // suggest themselves. Two of them 
depend on this parameter’s property of marking the centre of 
the curve. The first consists of taking the simple average of all 
the observations. This estimate of is termed the mean of the 
distribution. The second way is to use as the estimate of ii the 
magnitude of that observation which has an equal number of 
others smaller and larger than itself. This is termed the median 
of the distribution. The third way of finding fx depends on the 
fact that it is this class which is most frequent, i.e. that the value 
of m is greatest at this value of a: in the graph. as found in 
this way is the mode of the curve. These three characteristics 
can all be used as estimates of fi in the case of the normal curve, 
as they coincide, but in other curves, notably asymmetrical ones, 
they have different values. 

Two ways of estimating a have been proposed, and as might 
be expected both depend on finding the average magnitude of 
the deviations of the various observations from the centre of the 
curve. The simplest way is to take the average of these devia¬ 
tions, ignoring any question of sign, i.e. taking no account of 
which side of the centre the particular observation lies. This 



THE MEAN AND STANDARD DEVLVTION 


27 


gives the mean deviation from which <7 is estimated. The more 
complex way consists of squaring all the deviations, so removing 
any difficulty introduced by sign, finding the mean of these 
squares and taking its square root. This is termed the standard 
deviation, and is a direct estimate of o. 







FIG. I 

In the centre, a normal distribution showing how the mean, fi. fixes the 
position of the curve, and the standard deviation, a. measures its spread The 
shaded portions are the tails cut off by the ordmatos marking the deviation, d. 
The four comers illustrate how curves showmg Skewness and Kurtosis (sohd Ime) 
deviate from the normal distribution (broken Ime) 


Now these various ways of estimating // and a would all lead 
to the same value, viz. the true value, of each parameter if an 
infinitely large number of observations were available; but it is 
clear that with the limited samples obtainable in practice they 
will differ among themselves. It is thus necessary to decide the 




28 


STATISTICAL ANALYSIS IN BIOLOGY 


questions of which is the best estimate of /z, mean, median or 
mode, and which gives a better estimate of a, mean deviation or 
standard deviation. Discussion of the considerations involved 
in answering these questions must be deferred to a later chapter, 
and it must suffice for the present to say that fi is most efficiently 
estimated by the mean, and a by the standard deviation. 

The mean and standard deviation as found from observed 
data are only estimates of the true parameters of the curve. 
We have seen in the previous chapter how a sample of individuals 
drawn from a population whose characteristics are known can, 
and indeed must, as a result of the process of random sampling, 
show departures from these ideal features, realizable only 
with infinitely large numbers. Thus the estimated mean and 
standard deviation will not have the exact values of and o. 
They will be just as much subject to sampling error as the number 
of fem-leaved plants in a segregating family. It is necessary, 
therefore, to distinguish carefully between the parameters char¬ 
acterizing the population from which our sample is drawn, and 
the estimates of these parameters calculated from our limited 
data. One way of emphasizing this distinction, suggested by 
Fisher, is to denote the parameters by Greek letters and their 
estimates by corresponding Latin letters. Thus the estimated 
mean should be denoted by m, and the estimated standard 
deviation by 5 , This practice will be followed as far as possible 
throughout this book, but one great exception must be made. 
It is, and has long been, customary to denote the mean of a series 
of observed values, , , , Xj, by the symbol z, and this 

well-established custom will be followed, even though it conflicts 
with Fisher s system of notation. The symbol 7/1 is thus released 
for use in another conne xi on. 

10. FITTING THE NORMAL CURVE 

Details of the arithmetical process of estimating the mean 
and standard error can best be illustrated by a numerical example. 

Example 2. In Table 5 are given the heights, in decimeters, 
of 530 maize plants (Emerson and East’s data). We have seen 
that such a character as height is determined by the action of 
a large number of different factors, just as the mean number of 
individuals of a given kind in a segregating family is determined 
by the^ joint behaviour of a number of factors, viz. the number 
of individuals in the family, each of which may, or may not, 
fall into the category in question. So each height measure¬ 
ment corresponds to a family in the genetical examples of the 
previous chapter. Now the chance of obtaining a family of a 
given kind could be calculated from theory, but it could also 
have been found experimentally by the process of raising a large 



FITTING THE NORMAL CURVE 


29 


TABLE 5 


Distribution of the Heights of Maize Plants {in Decimeters) and the 
Calculation of x and s^ using a \\’orki)tg Mean (Emerson and East) 


Height of plant 
in dms. (x) 
[class centre] 

Dov'itttion from 
the working mean 
(Xm « 15 dms.) 

i 

j Frequency 
observed (a) 

a(x-r«) 


7 

-8 

1 

- 8 

64 

8 

-7 

3 

- 21 

147 

9 

-6 

4 

- 24 

144 


-5 

12 

: - 60 

300 

li 

-4 

25 

1 -100 

400 

12 

-3 

49 

-147 

441 

13 


68 

1 -136 

272 

14 

-1 i 

95 

- 95 

95 

15 

0 

96 

1 

0 

1 

0 

16 

1 

78 

78 

! 78 

17 

2 

53 

106 

212 

18 

3 

26 

78 

234 

19 

4 

16 

64 

256 

20 

; 6 

3 

15 

75 

21 

■ 6 

1 

6 

1 

36 

Total . . 


530 

-244 

2,754 


244 

x=15—dms. 
ooU 


244® 

iS(x-x)*=2,754—^*2,C41-6679 


^=629 

.„ 4 . 993 Q_Q.Qg 33 [Sheppard’s correction] 
i«4-9103 


VT*=2-2169 


number of families of the size in question and observing how 
many fell into the classes with 0, 1. &c., of the particular kind 
of individual. The frequency distribution obtained in this way 
is an estimate of the probability distribution of famihes of the 
size in question. Just in the same way the frequency distribution 
of the heights of the maize plants is, within the limits of sampling 
error, a model of the probability distribution of such heights. 
The observed frequency of a given class is an estimate of the 
probability of obtaining a pl^nt of that kind. The frequency 

3 










30 


STATISTICAL ANALYSIS IN BIOLOGY 

distribution may thus be used as material for the estimation of 
the characteristics of the distribution of probabilities of plants 
with the various heights. 

In the data given in Table 5 the exact height of each plant 
is not given, the plants being grouped into classes whose central 
values are at 7, 8, 9, &c., decimeters. This can be done by 
measuring to the nearest decimeter, or, at a later stage, by 
grouping all plants with heights between 6*5 and 7*5 decimeters 
into the 7-dm. class, and so on. This process is a great con¬ 
venience as it simplifies calculation. When finding the mean, 
for example, instead of adding a large number of single values, 
we can do the major part of the work by multiplying together 
the frequency of each class and the central class value, subse¬ 
quently summing the relatively small number of products 
obtained in this way. Tedious additions are replaced by rapid 
multiplications. Grouping does, however, have certain undesir¬ 
able consequences. These follow from the fact that the 
data are really being replaced by fictitious results when the 
process of grouping is applied. In replacing the plants whose 
heights fall between, say, 8*5 and 9*5 dms. by an equal number 
of fictitious individuals, all of height 9 dms., the spread of the 
frequency distribution is spuriously increased, since if the true 
heights follow a normal curve the number of plants between 
8-5 and 9 dms. is less than that between 9 and 9-5 dms. Thus 
the average of the true plants is nearer to the centre of the curve 
than is 9 dms. On the other side of the curve the fictitious 
plants have too great a height, as the average of the ac^al 
plants in the class, in being nearer to the centre, is lower than 
the value assigned to the fictitious individuals. _ 

Grouping does not bias our estimate of fx because, as will be 
seen from the foregoing remarks, the discrepancies between the 
fictitious plants and the ones they replace are of opposite sign 
on opposite sides of the curve. The normal curve is symmetrical 
and so the discrepancies cancel out. The estimate of the standard 
deviation is, however, markedly biased by grouping, as the dis¬ 
crepancies are all deviations from the mean and so will be summe 
in the process of calculating this quantity. The estimate arrived 
at from grouped data will thus be spuriously large. Fortun¬ 
ately, due allowance can be made for this bias, as will be shown 

below. , . 1 T 4 . * 

The best estimate of fi is the average plant height. It is 

found, of course, by summing the heights of the plants and 

dividing the total so obtained by the number of individuals 

measured. This is expressed algebraically by the formula 

where n is the number^ of individuals measured and 



fitting the normal curve 31 

8 means ‘ the sum of The obvious wav of doing this in our 
example is to find (7x l)+(8x3)+(9x4)+(10x 12)+ . . . +{20x3) 

+(21x1) and divide by 530. This gives i=iZ^I?^= 14 . 539 G dras. 

o30 

The numbers involved in this calculation are somewhat too lar^^e 
to allow of easy manipulation unless a calculating machine Is 
used, so the device of the working mean is to be recommended. 
TWs is usually most conveniently taken at the value of the class 
with the greatest frequency, which in the present case is 15 dms. 
The various classes are then regraduated, using the working mean 
as the origin. Thus the 14-dra. class becomes -1, the 13-dra. 
class —2, while 16 dms. and 17 dms. are replaced by 1 and 
2 respectively (Column 2, Table 5). We then find (-8xl)+(_7x3) 
+(-6x4)+ . . , +(4xl6)+(5x3)+(6xl). The sum of these fourteen 
quantities is -244, wliich when divided by 530 gives -0-4604. 
Hence our estimate of the true mean is 15-0-4604 or 14-5396, as 
found earlier by the heavier method. 

The next operation is that of estimating the standard devia¬ 
tion, for which purpose it is necessary to find the squared 
deviations from the mean and to take the square root of their 
average. Expressing this algebraically : 

n—l 

In order to find the sum of the squared deviations we could 
regraduate all the classes in terms of their deviations from 
14-5396. The 14-dm. class would become -0-5396, and the 
13-dm. class -1-5396, while 15 dms. would be replaced b}’^ 0-4604 
and 16 by 1-4604. The sum of squares of deviations is then 

[(-7-5396)2xl]+[(-6-5396)*x3]+[(-5-5396)2x4]+ . . . 

+[(4-4604)2xl6]+[(5-4604)2x3]+[(6-4604)2xl] 

This is an extremely cumbersome calculation and is never 
attempted in practice, as the device of the working mean offers 
an easier and more exact method. Let us denote the working 
mean by The deviation of any z, say x^, from the true 
mean x can be expressed as the difference of two quantities 
one a deviation of Xi from x„ and the other the difference between 
x„ and X. Thus 

{xi-x)=^{x^-xj-{x~xj 

and 

Summing the n items of this kind, obtained by squaring each 
of the deviations of the n different x values observed, we find 

S{x-x)^~S{x-xJ^-2S[{x-xJ{x-xJ]-,S{x^^^ 
~S{x-x„)^-2{x~xJS{x-xJ-i-n{x~x^)^ 
as X and x„ are constant. 



32 STATISTICAL ANALYSIS IN BIOLOGY 

But n(x-x„)=nx-nx„ 

^S{x)-nx^ by definition of x 

and so 

71 71 

=S{x-xJ^--S‘{x-x„) 

71 

where stands for ‘ the square of the sum of 

Thus we obtain the sum of squares of deviations from the 
true mean by first finding the sum of squares of deviations from 
the working mean and subtracting from it a quantity which is 
the squared sum of deviations from the working mean divided 
by the number of observations. 

In the maize example the working mean already used is at 
a;„=>15 dms. Then the sum of squares of deviations from this 
working mean is easily found as 

[(-8)2xl]+[(-7)2x3]+[(-6)2x4]+ . . . +[42xl6]+[52x3]+[62xl] 

which is 2,754. The sura of deviations from the working mean 
has earlier been found to be -244 and the number of observations, 
n, is 530. 

The sum of squares of deviations from the true mean is thus 

2,754-t?ML^ 

530 

-2,754-112-3321=2,641-6679 

The square of the standard deviation is then 

2,641-6679^529=4-9936 

It was pointed out earlier that grouping of the data would 
lead to a spuriously high value for the standard error. Sheppard’s 
correction for the error of grouping may be applied to rectify 
this, before taking the square root; it consists of subtracting 
•xV of the unit of grouping. In the present case the grouping 
unit is 1 dm. and so 0*0833 is subtracted from 4-9936 to leave 
4*9103 as the square of the standard deviation, which is itself 
then found to be 2-2159. 

The square of the standard deviation, which it will be noted 
must always be found as part of the calculation of the standard 
deviation, is termed the variance (F), or, when estimated, the 
mean square. The mean and standard deviation are linear 
quantities expressible in our example in decimeters, but the 
variance is a quadratic quantity expressible in square deci< 
meters. 

In obtaining the variance, the sum of squares, 2,641*6679, 



mTINO THE NORMAL CURVE 


33 


was divided not by n, but by n-1, i.e. 529, which is termed the 
number of degrees of freedom {N). The choice of n-1 instead 
of n as the divisor can be justified by a simple consideration. 
The stR-ndard deviation and variance are measures of the ‘ spread ’ 
of the curve and as such are based on the magnitudes of differ¬ 
ences. The greater the spread of the curve the greater will be 
the average difference between the individual x values and the 
mean ; hence the greater also will be the average difference 
between any two individual x values. In other words, at least 
two values are required to afford any idea of the spread of the 
curve and any two values will supply but one difference. Just 
as two values give one difference, three values give two inde¬ 
pendent differences and in general 7i values give n-1 independent 
differences. If the mean is fixed by hypothesis the difference 
between this theoretical mean and that actually observed will 
contribute an nth difference. -But in our example the mean 
was not fixed by hypothesis, and so the n observations together 
contribute only n-1 independent differences, or degrees of 
freedom, to the estimation of the standard error. 

This justification of the use of n-1 as the divisor may be 
completed by considering the case of only one observation. If 
the mean were fixed by hypothesis this single observation would 
supply one difference for the estimation of ff. Where, however, 
the mean must be calculated from the data, as it must in the 
majority of instances, the observation is itself the best estimate 
of the mean and hence the deviation of observation from the 
mean will inevitably be 0. If n-1 is used as the divisor the 


variance is found to be 



which is indeterminate—a very reason¬ 


able result. But if n is used as the divisor the variance would 


be p which is clearly ridiculous. It will be observed that it is 

the fitting of the mean which reduces the number of degrees of 
freedom from n to n—1. The principle that a degree of freedom 
is lost every time a parameter is estimated is highly important 
and will be considered in detail later. 

Reverting for a moment to the use of the working mean in 
finding the sum of squares of deviations from the true mean, it 
m us t, be emphasized that the working mean need not necessarily 
be chosen near to the true mean. It has been shown that the 
true mean can be found when the working mean is taken at 
a:„j=0 and the sum of squares can be found by using the same 
value. The appropriate formula then reduces to 

S(x-xy-S{x^)-h^(x) 



34 


STATISTICAL A^fALYSIS IN BIOLOGY 


In the present example this calculation is made as follows : 

iS(a:2)=(72xl)+(82x3)+(92x4)4- . . . 

+(192xl6)+(202x3)+(2l2xl)=114,684. 

*S(a’) has early been found to be 7,706, so 

7 70fi2 

*S:(3:-f)2=114,684-11^ =114,684-112,042-3321 

530 

-2,641*6679 

as before. 

When x„=0 the calculation is heavy, and so unless a calculating 
machine is available it is better to take a value near to the true 
mean. For machine calculation, however, the value of 0 is 
preferable, since large numbers present little difficulty and the 
labour of finding deviations from the working mean is obviated. 
The use of a working mean involves no approximation in 



FIG. 2 

Frequency histogram of Emerson and East’s maize heights and the normal 
distribution fitted to them 

finding the sum of squares. It leads to a fully accurate value. 
In fact, the results obtained in this way are often more accurate 
than those obtained by squaring and summing deviations from 
the true mean, because the latter process involves multiplication 
of indefinite decimal fractions, whereas the working mean 
necessitates only the squaring of whole numbers. 

Having estimated ft and u, their numerical values may be 
substituted in the general formula of the normal curve. This 
gives the particular expression which best fits the maize data. 
The formula obtained in this way is 

j (ar-14-5306)* 

- - P, 3x22150* 

2-2159 V2jr 

and the corresponding curve is plotted in Fig. 2 together with 


THE POISSON SERIES 


35 

the original data. It will be seen that the agreement between 
observation and expectation is quite good. 

XI. SKEWNESS AND KURTOSIS 

Fitting the normal curve depends on calculating fj. and a, 
which are termed respectively the first and second moment.^? of 
the data. Higher moments can be calculated, in particular the 
third and fourth which depend, as might be expected, on S(x-x)^ 
and respectively. 

The third moment is of interest in that it affords a test of 
the skewness of the distribution. Where this is symmetrical, as 
in the case of the normal curve, the third moment is 0. A positive 
value of the sum of cubes of deviations indicates that the curve 
has a long tail to the right and its mode to the left of the mean. 
This is positive skewness. When the long tail is to the left and 
the mode to the right, the sum of cubes is negative and the 
skewness is said to be negative. 

A curve may be symmetrical and yet not normal, in that it 
shows kurtosis, which means that, relative to the shoulders, the 
centre and tails have too many or too few values for normality. 
Kurtosis is detected by the use of the fourth moment. Nega¬ 
tive kurtosis implies that the shoulders are excessive and 
positive kurtosis that the centre and tails contain too many 
observations. Examples of such curves are given in Fig. 1. 

In practice, moments of order higher than the second are not 
often used. It is occasionally necessary to test for ske^vness by 
using third-degree statistics, but such needs are rare. Kurtosis 
is very seldom considered. Practically the whole of present-day 
statistical analysis depends on the comparison of variances and 
co-variances, i.e. on the use of second-degree statistics. It is 
nearly always assumed that the data fall on a normal distribution, 
and, though the fact that skew curves are not too uncommon 
shows that this assumption is not generally justified, the errors 
introduced into second-degree calculations by such departures 
from normality are small and may be neglected. Actual data 
have been analysed by Eden and Yates and the insignificance of 
these errors has been fully demonstrated. 

12. THE POISSON SERIES 

A binomial distribution approaches normality as k in the 
formula (p+q)^ increases, provided that both p and q are of an 

order larger than ^. Where, however, p or g is of the magnitude 

fC 

the limit approached by the binomial as k increases is not the 




STATISTICAIi AlfALYSIS IN BIOLOGY 


continuous normal curve but the discontinuous Poisson series, 
whose formula is 




jU* 

sT’ 



where ^ is the mean. The chief characteristic of this distribution 

is that its variance equals its mean, i.e. 

Data on the frequency of a given event are expected to fall 
on a Poisson series when the probability of the event happening 
on any one occasion is very small but when the number of 
occasions is sufficiently large for the event to be observed not 
too infrequently. One example, quoted by R. A. Fisher, is that 
of deaths of soldiers as a result of mule kicks. The ch*mce of 
any one soldier being killed in any one year in this way is small, 
but where a whole army corps is exposed to the risk a number 
will sometimes be killed. Bortkewitch records that when 
10 Prussian corps were exposed to mule kicks over a period of 
20 years, the following frequency distribution of deaths per 
corps per year was obtained : 

Deaths per corps per year .0 1 2 3 4 Total 

Frequency . . • 109 65 22 3 1 200 

Various biological problems involve the use of this distribution, 
notably those arising when estimating the density of cells or 
organisms by haemacytometer counts or by the plating method. 
The same situation is encountered when measuring the density 
of plants in the wild by the use of quadrat counts. If a quadrat 
of appropriate size is used the chance that any one plant in 
a given one of its squares will be of the species under investiga¬ 
tion is small, but a sufficient number of plants are included to 
ensure that a small number of this species are in fact found. 

Example 3. The following data on the density of Eryngium 
maritimum were collected by Blackman. In all, 147 counts 
were made and 0, 1, 2, 3, &c., plants of this species were found 
in a square of the quadrat with the frequencies shown in Table 6. 
The table also includes details of the estimation of the mean 
and variance using a working mean of 0. It will be seen that 


f=.???«l-9931. The sum of squares of deviations from the mean 

, r 

is 851-584-0068-266-9932, and since there are 146 degrees ot 
freedom or independent differences on which the sum of squares 
is based 


266-9932 


146 


-1*8287 


Since the data were not artificially grouped, Sheppard’s correction 
is not iised. 




THE POISSON SERIES 


37 


TABLE 6 


The density of ErjTigium maritimum {Blackmnn^s data) 
Number of Plante per Quadrat Squares 


Number of 
plants per 
quadrat square (r) 

Frequency 

(a) 

ax 

OX* 

Frequency 

expected 

(mn) 

0 I 

1 

16 

i 

\ 

0 

0 

20-03 

1 

41 

41 

41 

39-93 

2 

49 

98 

196 

39-79 

3 

20 

60 

180 i 

26-44 

4 

14 

56 

224 

13-17 

5 

5 

25 

125 

5-25 

6 

1 

6 

36 

1-74 

7 

1 

7 

49 

0-50 

8 and over 

0 

0 

0 

0-15 

Total 

1 

147 

293 

1 

851 

147-00 


293 

0931 


5{x-f)*=8ol--j~^- = 26G-9932 
2^=140 


„ « S(x—x)* 

F-a*=-^-jrp-^= 1-8287 
N 


The estimates of mean and variance agree reasonably well. 
Perfect agreement is not to be expected, for though tlie true 
mean and true variance are equal, sampling error will cause 
slight discrepancies between their estimated values. 



FIO 3 

Frequency polygons of the number of Eryngium plants observed by Blackman 
to fall into quadrat squares (solid line) and the Poisson series fitted to the data 
(broken line) 



38 


STATISTICAL ANALYSIS IN BIOLOGY 


The numerical value found for x can be substituted in the 
general formula and gives as the expected frequencies 


1-9981 



19931, 


1-99312 1.99313 

2! ' 3! 



for the classes 0, 1, 2, &c. These expectations are given in 
Table 6, and are also plotted together with the observed values 
in Fig. 3. The close agreement of the data with a Poisson 
distribution is well illustrated by both table and figure. 


13. THE MEAN AND VARIANCE OF THE BINOMIAL DISTRIBUTION 

In both the Normal and Poisson distributions the mean and 
variance are usually estimated from the data themselves. The 
necessity for this arises from the fact that these distributions 
are used to replace binomials when information about p and k is 
lacking. Now clearly if the values of these latter quantities 
are given the distribution is known exactly and there is no need 
to estimate p and o from the data. The mean and variance 
would in fact be calculable from p and k, irrespective of the 
magnitude of the latter, i.e. whether a normal distribution could 
be assumed or not. The appropriate formulae are: 

and 

These formulae give the theoretical values of the parameters 
and do not merely lead to estimates of them. Estimates could 
of course be made from the observed data, but they would not 
be so valuable as the values expected, unless it could be shown 
that the data did not agree with expectation, in which case the 
initial premise that p and k are known is invalid and resort 
must be made to some form of estimation. Even in such cases 
the proper approach would not always be by the estimation of 
the mean and variance, but rather by that of p and k. This 
wdll be discussed in more detail in a later chapter. The following 
example will, however, serve to illustrate the difference between 
parameters and their estimates. 

Example 4. Fisher and Mather have described a genetical 
experiment with mice which consisted of a backcross in which 
the gene Wv, wv (straight—wavy hair) was segregating. The 
females were all wavy (wv wv) and the males heterozygous 
straight (Wv wv). Hence by mendelian theory each mouse in 
their offspring has a half chance of being straight haired and a 
half chance of being wavy. So in litters of 8 mice it is expected 
that 8 straight and 0 wavy, 7 straight and 1 wavy, &c., will be 
found with relative frequencies given by the expansion of (i^-i)®» 
i.e. and A>8. The theoretical mean proportion of straight- 




MEAN AND VARIANCE OF BINOMIAL DISTRIBUTION 39 

haired mice, is equal to p, i.e. is h The variance of the 
distribution is i.e. JxJxJ, which is gV or 0 03125. The 

standard deviation is thus ^0 03125 or 0 1796. But we may 
also estimate the mean and variance from the actual experimental 
data. 


TABLE 7 


Segregation for Straight —TTari/ in Litters of Eight Mice (Fisher and Mather) 



Proportion ! 
of straight. Frequency 
haired obsei^-ed 

mice (x) 


•000 
•125 
•250 
0-375 
0-500 
0-625 
0-750 
0-875 
1-000 


Number of 
straight* 
liuired 
mice (x) 



0-000 


T 

lIj 


1-500 

6-000 

3-750 

3-750 

1-750 

0000 


0-000000 

0015625 

0-125000 

0-562500 

3-000000 

2-343750 

2-812500 

1-531250 

0-000000 







0 

1 

8 

36 

192 

150 

180 

98 

0 



Table 7 gives the segregation observed in 32 such litters. 
The type of litter is denoted by the proportion, x, of straight- 
haired mice that it contains. Thus a litter with seven straight 
and one wavy has 0*875 of its mice straight. The second column 
shows the observed frequency (a) of such litters. The third and 
fourth columns show ax and ax^y which are necessary for the 
calculation of the estimates. The working mean is taken at 0 for 
this purpose. S{x) is 17*375, so the estimate of //, viz. is 
0-542969. The sum of squares of deviations from the mean is 









40 


STATISTICAL ANALYSIS IN BIOLOGY 


10-390625-9-434082, which reduces to 0*956543. Since there are 
32 observations the sum of squares is based on 31 degrees of 
freedom, so the estimate of the variance, 5^, is 0*03086, and, by 
taking the square root, the standard deviation is estimated at 
0 1757. 

The estimates of the mean and the standard error differ from 
their theoretical values, though the deviations are not large. 
This is to be expected, since the theoretical values are based on 
the properties of a hypothetical population including an infinite 
number of litters, whereas the experimental data comprised only 
32 litters. The estimates are thus subject to sampling errors. 
A closer approximation to the expected values would be antici¬ 
pated if more families were used as material for estimation. 

A further important point may be illustrated by these data. 
So far the measure of x used has been the proportion of straight- 
haired mice in the litter. Instead, we might equally well have 
used the number of such mice. The corresponding estimations 
of the mean and variance are carried out in the second part of 
Table 7, and we find that x is now 4*34375, while 8^ is 1*9748. 
These are respectively 8 and 64 times as big as the values obtained 
when the proportion of mice was used as the measure. The 
standard deviation will, of course, be 8 times the earlier value, 
just as was the mean. It is, in fact, a general rule that if the 
scale of X is increased by a factor C the mean and standard 
deviation will be multiplied by C and the variance by C^. 

The same rule holds good for the expected mean and variance. 
Thus taking x as the number of straight-haired mice in the litter 
is equivalent to using the expansion of (4+4)®. Then p, is equal 

to 4 and c® is -^-2. It will be observed that these values are 

O 

respectively h and Ic^ times as big as the figures obtained earlier. 
Tlie formulae can then be written in terms of the original p and 
h as p=pk and a^-=pqk. These formulae are appropriate to the 
analysis of the observed numbers of individuals in the various 
classes, while the earlier ones are used for the analysis of the 
proportions of individuals observed to fall into the two groups. 

REFERENCES 

BLACKMAN, Q. E. 1935. A study by statistical methods of the distribu¬ 
tion of species in grassland associations. Ann. Bot., 49, 749-77. 
EDEN, T., and YATES, F. 1933. On the validity of Fisher’s z test when 
applied to an actual example of non-normal data. J. Agr. Set., 
33, 6-17. 

EMERSON, R. A., and EAST, E. M. 1913. Inheritance of quantitative 
characters in maize. Neb. Exp. Sta. Res. Bui. 2. 
rtsHEB, B. A., and matheb, e. 1936. A linkage test with mice. Ann. 
Eugen., 7, 265-80. 



CHAPTER TV 


TESTS OF SIGNIFICANCE 

14. THE NORMAL DEVIATE 

ANY observed quantity is likely to de\nate from its expected 
value for either of two reasons. In the first place it may deviate 
because of sampling error arising from the necessary use of 
finite numbers of observations; and, secondly, it may depart 
from expectation because the premises on which the expectation 
is based are invalid. The purpose of a test of significance is to 
effect a distinction between these two sorts of deviation. 

We have already seen, in Section 6, how such a test of signi¬ 
ficance works in the case of observations relating to an hypothesis 
on the basis of which probabilities are obtained by the expansion 
of a binomial expression. The operations involved are the same 
for all tests of significance. It is first necessary to determine, 
from the hypothesis to be tested, the distribution of probabilities 
of obtaining the various observable values ; and, secondly, to 
assess from this distribution the probability of obtaining, as a 
result of sampling error, as bad or w'orse fit with h 3 rpothesis than 
that shown by observations in question. If this probability is 
suflBciently small it is then assumed that the hypothesis is incap¬ 
able of explaining the data, or, in other words, that the deviation 
of observation from expectation is significant. 

In the genetical example of Section 6 it was easy to evaluate 
the probabilities of all possible types of family. It was then 
possible to arrive at the probability of obtaining as bad or worse 
fit by the simple process of picking out all the types of family 
which fell into this category and summing their probabilities. 

It is, however, obvious that this technique is applicable only 
when the number of observable types is limited. Where, for 
example, the observation in question is one concerning a char¬ 
acter such as plant height, which by hypothesis can have any 
value between wide limits, the probabilities of the various heights 
following a normal curve, a different approach is necessary. 

In Example 2 it has heen shown that the frequency distri¬ 
bution of heights of 530 maize plants agrees well with a normal 
curve of mean 14*6396 and standard deviation 2*2159. From 
this we infer that the probability (m) of any plant of the population 
from which these 530 were taken will have a height of x dms. is 
given by the formula 

1 (a?-14-8396)* 

- —e 2X2-2169* 

2*2169V27r 

41 



42 


STATISTICAL ANALYSIS IN BIOLOGY 


or, more precisely, the probability dm of any plant having 
a height falling within the small range dx is 

1 ( j-14-5396)* 

dm= -=e 2x22169* cZa; 

2-2159V'2jr 


This probability is greatest when x= 14*5396 and gets progressively 
smaller as a:-14*5396, which we may denote as d, increases, 
irrespective of sign. Thus the best agreement with the hypothesis 
that a given plant is a member of this population is obtained 
when the plant in question is 14*5396 dms. tall, and the fit gets 
worse as the plant is found to be further from this height. So in 
order to find the probability of obtaining at least as bad a fit 
with hypothesis it is necessary to sum the probabilities of all 
heights which deviate from the mean by at least d dms. These 
probabilities together constitute the two tails of the distribution, 
as shown in Pig. 1. 

The normal curve is continuous. So it is impossible to find 
the probability which the tails of the curve represent by summing 
the individual probabilities of the various possible heights 
included in these tails, because the number of such heights is 
theoretically infinite. What must be done is to find the area 
of the tails by integrating the curve from - oo to 14-5396-d and 
from 14*5396+(i to + oo. Since the curve is symmetrical this is 
the same as finding twice the integral from - oo to 14*5396-d (see 
Fig. 1). The integration is itself a cumbersome operation and 
need not be carried through here. It results in the finding that 
the area of the tails cut off by lines at the distance d from 


the mean is a function of —• Thus for any normal curve the 

deviation ca has a probability dependent only on c, which is 
called the normal deviate. Tables have been prepared relating 
this probability to the value of c. So in order to determine the 
sigmficance of any deviation, d, it is only necessary to find 

C“— and look up the corresponding probability in a table of 


normal deviates such as that given in Table I at the end of this 
book. An example of the calculation will be given later. 


16. THE t DISTRIBUTION 

The normal deviate, c, is defined as the ratio of the deviation, 
d, to the standard deviation, a. But, as we have already seen, 
the exact value of o is not always known. The best figure which 
is available is most commonly an estimate, 5 , of a. Now the 
estimate s is itself subject to sampling error and will depart more 
or less widely from the true standard deviation, the accuracy of 



43 


THE t DISTRIBUTION 


the estimate increasing as the number of degrees of freedom, on 
which it is based, increases. When this number of degrees of 
freedom is reasonably large, say thirty or more, the error to 
which s is subject becomes sufficiently small to be neglected 
with safety. In such cases s may be used to replace a without 
serious inaccuracy resulting. But if s is estimated from a small 
number of comparisons, its departure from a may be very wide, 
and to use it in the calculation of a normal deviate would be to 
invite serious trouble. The test of significance in such cases 


requires a knowledge of the distribution of — rather than —. 

s a 

This distribution was calculated in 1908 by ‘Student’ and is 
known as the t distribution. It is unnecessary to give the 
details of its derivation, but its formula is quite a simple one. 



Two points should be noted concerning this quantity. In 
the first place it is expressed solely in terms of t which is foimd 
as d/s. o does not enter into the formula, and so < is a quantity 
exactly calculable from observational data and involves no 
assumptions about the unknown parameter. In other words, it 
affords an exact test of significance for use when the true standard 
deviation is unknown. Secondly, the t distribution takes into 
account the number of degrees of freedom (N) * on which the 
estimate s is based. This is to be expected as the sampling 
error of s depends on N. Table II at the end of the book shows 
t tabulated against the probability of getting at least as bad 
a fit (i.e. as big a value for t) as the one observed. It will be seen 
that the table of t is two-dimensional where the normal deviate 
was tabulated in one dimension. The second dimension in the 
case of t is which must be known before any value can be 
entered in the table. Thus to use t we must find d, s and N. 
t is the ratio of d to 5 and is entered in Table II in the row corre¬ 
sponding to whatever value is found for N. Any given level of 
probabiSty is reached by smaller and smaller values of t as 
N increases. 

Since the sampling error of s becomes zero as N becomes 
infinitely large, t reaches the normal deviate, c, as a limit under 
these circumstances. So the normal deviate is a special example 


• N will be used to denote the number of degrees of freedom, in order 
to avoid confusion with the sample size, n. It should, however, bo noted 
that Fisher frequently uses w for the number of degrees of freedom. Fisher 
and Yates also follow this practice in their book of tables. 



44 


STATISTICAL ANALYSIS IN BIOLOGY 


of t and may be tabulated in a single dimension solely because 
it is really a single row of the two-dimensional t table. 


16. THE * DISTRIBUTION 


The use of both t and c is limited to cases where the signi¬ 
ficance of a single deviation is in question. It is, however, clear 
that such cases are only special examples of the general situation 
in which the joint significance of a number of deviations is at 
issue. The appropriate general tests can be derived from a 
consideration of the nature of c and t. 

Consider for a moment a t for one degree of freedom. The 
numerator, d, is a single deviation or difference. The denominator 
s is a standard deviation based on one degree of freedom, i.e. is 
also a single difference. The two parts of the fraction are exactly 
similar in nature, and so the numerator can be considered as a 
standard deviation based on one degree of freedom. As the 
number of degrees of freedom of t increases, the deno min ator 
becomes the root mean square of a number of differences, though 
all fs have a single difierence as their numerator. There is, 
however, no reason why the numerator should not be a root mean 
square just as the denominator can be, and, as such, be based 
on any number of degrees of freedom. Such a ratio of two root 
mean squares, characterized by two numbers of degrees of 
freedom, that of the numerator, and that of the denominator, 
Nxy is the generalization of t. 

This general distribution was first found by R. A. Fisher, 
who, however, did not use the ratio of the two root mean squares 
as his variable, but chose instead the natural logarithm of this 
ratio. This he termed 2 , and the formula of his 2 distribution is 


dm=2 


A^+A^-2 I 
2 




1 A ^,-2 

2 • 


The transformation into logarithms has certain advantages 
for interpolation in a table of 2 , but the direct use of the ratio of 
the root mean squares, or of the ratio of two mean squares from 
which the roots are derived, has much to recommend it. Various 
of these forms of 2 have been tabulated. Fisher and Yates's 
book of tables include 2 and the variance, or mean square, ratio. 
This latter has also been tabulated under the symbol F by 
Snedecor, while Tedin has given the table of the root mean 
square ratio. Table IV at the end of this book gives the Variance 
Ratio. This will be found sufficient for all practical purposes. 
Simple transformation will give any other forms in which it is 
desirable to represent a z value. 



45 


TnE Z DISTRIUTTION 

In the calculation of z or the variance ratio, just as in the case 
of f estimated standard deviations and variances are used. The 
probability distributions of z and the variance ratio are inde¬ 
pendent of the two true standard deviations or variances of 
v^ch estimates are being used. Hence z and the variance ratio 
afford exact tests of significance of observed data. 

Each value of z, just like each value of c and t, corresponds 
to some definite probability of finding at least as bad a fit, or, in 
other words, at least as large a value ofas that actually obtained. 
In order to use z for the purpose of finding this sampling error 
probability it is necessary to know the number of degrees of 
freedom appertaining to both numerator and denominator. 
Any given level of probability is reached by smaller values of 
2 as both and increase, although these two numbers of 
degrees of freedom do not have an equal effect on the relation 
between z and its probability. 

There is never any ambiguity as to which is the numerator 
and which the denominator in the case of t, since the former must 
always be a single difference. In the case of z, on the other 
hand, this simple method of decision is not available and ambigu¬ 
ity may arise. The rule which covers all cases is, however, quite 
a simple one. The numerator is always the larger mean square 
or root mean square. This usually, but not always, has the 
smaller number of degrees of freedom. Confusion is sometimes 
caused by a failure to realize that it is the magnitude of the 
hiean square or root mean square, and not that of its number 
of degrees of freedom, which determines whether it will go into 
the numerator or the denominator. 

The tabulation of z or of the variance ratio causes some 
difficulty as it requires a three-dimensional table. The table of 
formal deviates has only one dimension, viz. the probability. 
The table of t has a second dimension, the number of degrees of 
freedom of the denominator. The table of z has still another 
mmension, the number of degrees of freedom of the numerator, 
^ow a three-dimensional table is impracticable and so it is 
^stomary to give a series of two-dimensional tables in its place. 

8 ch such table corresponds to a given probability level, the two 
di^tnensions used in the table being and N^, the two numbers 
of degrees of freedom. The probability levels for which tables 
of 2 and the variance ratio have been prepared are P=0-2, 0 05. 
0^01 and 0-001 (Table IV). It should always be remembered 
^hat these are ouly slices from a three-dimensional whole. 

The use of the variance ratio in analysis will be fully illustrated 
^ later chapters. 


4 



46 


STATISTICAL ANALYSIS IN BIOLOGY 


17. THE X* DISTRIBUTION 

The relation of to the variance ratio is like the relation of 
c to (. /Ms the special case of the variance ratio where the 
denominator is fixed by hypothesis, i.e. it is the ratio of an 
observed variance, or, more correctly, sum of squares, to another 
fixed by hypothesis. It might equally be said the x^ is a general¬ 
ized iust as the variance ratio is a generalized t . me 
numerator of differs from that of c* by being based on any 
number of differences, and so is characterized by a number ot 
degrees of freedom. The denominators of x^ and are eqmvalent. 

The y* table has two dimensions like the t table, one margin 
being the probability and the other the number of degrees ot 
freedom to which the value corresponds. It canimt, however, 
be over-emphasized that whereas the degrees of freedom ot t 
are those appertaining to the denominator, the de^ees of freedom 
of y2 are those appertaining to the numerator, t is charactenzed 
by the N., of z while x'^ characterized by the value. 

A yMs seldom obtained directly as the ratio of two variances 
or sums of squares. It is usually found in the form for which it 
was first calculated by Karl Pearson, viz. as a test of goodness 
of fit of observed with expected frequencies. The formula tor 
the calculation of observed in this way is 

^ |_ mn J 

where a is the observed class frequency, m the proportion expected 
in that class, n the number of observations and S indicating 
summation over all classes. The appropriate number of degrees 
of freedom is found as the number of classes to whicn value 

may be assigned arbitrarily. 

The probability distribution of x® is 


iV-2 [ 

9. • 


Its various uses will be illustrated by later examples, 
of X® found at the end of this book (Table III). 


A table 


18 THE INTERRELATIONS OF c, I. AND z AND THEIR USE IN 

ANALYSIS 

The four exact tests of significance, c, t, x® 
veloped by different mathematicians for widely different 
at different times. It is not then surprising that their m - 
relations have tended to be obscured. They are 
ratios of two variances, or alternatively of two standard . 
tions, though these ratios may be subjected to various si p 


THE INTERRELATIONS OF C, t, AND 2 47 

transformations. The most general form, and the last to be 

calculated, is 2. The other three are best understood as special 
cases of 2. 

It has already been noted that a z table is three-dimensional, 
the three dimensions being iV„ and P. the probabiUty. Such 
a table is shown diagrammatically in Fig, 4 , where the two 
numbers of degrees of freedom are marked along the sides of the 
face of the figure. The probability is marked along the receding 
dimension, is a variance ratio whose numerator has always 
one degree of freedom (i.e, iV'i=l). Therefore the t table is 
derived from the leftmost vertical side of the z table, is & 
variance ratio whose denominator is fixed by h^^othesis, which 



FIG. 4 

The three-dimensional z table showing the interrelations of c, t, x* and z 

is equivalent to its having 00 degrees of freedom (iV2=oo). 
Therefore the x^ fable is obtained from the under-surface of the 
2 table, c has a numerator with one degree of freedom and 
a denominator fixed by hypothesis. So c is a special case of 
both t and x^- Actually it is the lower left receding corner of the 
solid. 

Thus the four probability distributions are simply related 
and the choice of one of them for use in a particular problem 
depends solely on the nature of the data. The whole of the 
analysis of data really consists of their reduction to a state in 
which the question at issue can be formulated in terms of the 
ratio of two mean squares, or correspondingly of two root mean 
squares. This will be better appreciated from the examples and 
discussion of later chapters. 







48 


STATISTICAL ANALYSIS IN BIOLOGY 


It is of course clear that z or its transformation, the variance 
ratio, could be used in all cases, and that t or could be used 
to replace c wherever that quantity is relevant. The reason for 
the use of the various special cases of z is simply that they are 
tabulated more easily and fully than the general quantity. Thus 
the table of c given at the end of this book is somewhat fuller 
and distinctly less cumbersome than the tables of t and which 
in their turn are fuller and less cumbersome than the table of z. 

The use of the c, and z distributions forms the last stage 
of the statistical analysis of a set of data. The preceding opera¬ 
tions are concerned with reducing the data to a state in which 
the question at issue can be answered by means of a comparison 
between observed and expected deviations, in the way made 
possible by the test distributions. These preceding operations, 
however, usually involve more varied and heavier computation 
than does the actual test towards which the whole analysis is 
directed. In consequence the discussion of the preparatory 
calculations will be much longer and more complex than that 
of the test distributions themselves. It is, however, necessary 
always to bear in mind that, in general, these calculations exist 
only to reduce the data to the state necessary for the actual 
test. 

The form of the preparatory calculation, like that of the 
actual test distribution to be used, depends on the type of data 
for analysis. There are, broadly speaking, two such types, 
measurements and frequencies. The former generally involve 
the use of estimated variances and so the test is made with t or 
z distributions. These will be treated first in Chapters V-VII. 
Frequencies, on the other hand, can commonly be compared 
with variances fixed by hypothesis, and so require the use of 
c and tests. Their analysis is treated in Chapter XI with 
the exception of one simple example in Chapter V. The joint 
analysis of two or more variates almost always involves estimated 
variances and so is handled by extensions of the methods of 
Chapters V-VII. Hence, though mathematically more complex 
than some of the frequency analyses, regression and correlation 
are discussed before frequencies in Chapters VIII—X. Some 
problems of estimation are treated in connexion with the use of 
tests of significance, but the general discussion of estimation 
and its relation to significance tests is somewhat heavy and so 
is deferred to the last chapter. 

Before leaving the consideration of these four quantities it 
must be emphasized once again that they are exact in the sense 
that each of them is calculated from exactly known constituents, 
and involves no approximation. This, of coiirse, presupposes the 
correct use of the quantities. Thus a t which was mistakenly 



THE interrelations OF C, t, AND Z 49 

used as a c would not be exact. The inexactitude here, however 
lies with the user and not with the t distribution. 

The importance of the exact nature of these four distributions 
depends on the fact that in drawing conclusions from them no 
assumptions are made about those parameters of the population 
which are not given by hypothesis. Estimates of these para¬ 
meters are used and the distributions make the proper allowance 
tor the samp mg error of the estimates. Thus rigorous inferences 
may be reached from observational data alone, without involving 
any assumptions about the parent populations. 


REFERENCES 

^^^bwgh * SUitiatical Mathematics. Oliver and Boyd, Edin- 

FISHER R. A and YATO3, F. 1943. Statistical Toblcs for Biological 

2tf<r Afedical Besearch. Oliver and Boyd. Edinbrn-gh. 

SNEDECOR. G. w. 1934. Calculation and Interpretation of Analysis of 
Variance and Covariance. Ames. Iowa. ^ J 

NOTE.—In this chapter n! is used in a new way, viz. when n is not 
an integer. In sucli cases n! is defined as the Eulerian integral 

f* 

I xH~*dx 


which gives 


(-i)!=V^ 


and so on. 



CHAPTER V 


THE SIGNIFICANCE OF SINGLE OBSERVATIONS, 
SUMS, DIFFERENCES AND MEANS 

19. SINGLE OBSERVATIONS 

IN the last chapter we examined the principles on which tests 
of significance are based and discussed the distributions which 
are used in assessing significance. The application of these 
principles and distributions must be considered next. 

The normal deviate, c, may strictly speaking be used only 
when hypothesis fixes the variance of the distribution involved. 
Actually it may be used in place of t without serious error when¬ 
ever the number of degrees of freedom on which the variance is 
based is reasonably large, say 30 or more. Thus the ratio of an 
observed deviation to the standard deviation of the maize dis¬ 
tribution calculated in Section 10 could be treated as a normal 
deviate because the number of degrees of freedom is 529. 

It is the exception rather than the rule for the variance to 
be fixed by hypothesis, but cases are not uncommon, especially 
in genetical experiments. A genetical example will be chosen 
to illustrate the method of testing the significance of single 
observations and sums. 

Example 5. The gene P, p determines flower colour in 
Datura stramoniurriy PP and Pp plants having anthocyanin 
pigmentation, while pp individuals are white. On self-pollinating 
heterozygous Pp plants a 3 : 1 ratio for coloured to white is 
expected, and Sirks obtained an actual segregation of 59 coloured 
to 14 white. Does this agree with expectation? 

The frequencies with which families of 73 individuals contain¬ 
ing 0, 1, 2 . . . whites are expected are given by the expansion 
of the binomial 

i.e. and /fc=73. 

Then the number of white plants expected, pk, is J>«73 or 
18-25. The number observed is 14, so that rf, the deviation of 
observation from expectation, is 4-25. 

The variance of the distribution obtained when the binomial 
expression is expanded may be found as V-^pqky which in the 

3x73 

present case is or 13-6875. The standard deviation, or 

standard error, a, is obtained as the square root of the variance, 
80 being 3-6997, The normal deviate, c, is the ratio oi d U> o 

60 



SINGLE OBSERVATIONS 


51 


and 80 in the present case is- or 1-149. Table I shows 

that a c of this magnitude is equalled or exceeded as a result of 
random sampling in about 25% of cases. The deviation could 
thus reasonably be ascribed to sampling error and the fit with 
expectation passes as sufficiently good. 

The use of the normal deviate as a test of significance in this 
case requires some comment. Strictly speaking, the method is 
applicable only to data which may be supposed to fall on 
a normal distribution. The binomial approaches normality as 
k increases and when, as in the present case, /;=73, the error 
introduced by the assumption of normality is not large. This 
error largely results from the fact that, while the binomial 
distribution is discontinuous in that the variate can only take 
certain values, in the present case 0, 1, 2, &c., the assumption 
of normality presupposes that variation is continuous. A correc¬ 
tion can be applied in order to reduce this error. Yates’s 
Correction for Continuity, as it is called, consists of reducing 
the deviation, d, by 0-5 before calculating the normal deviate. 
Thus in the above example we should take d=4-25-0-5=3-75, 

which ffives c=--=1*014, with a probability of 30% instead 

3*6997 

of the 25% found when an uncorrected deviate is used. Actually 
Yates’s correction is rather too drastic, and while the original 
calculation overestimates the significance of the result, the 
corrected result is an underestimate, though much nearer to the 
true value. The true value can be found by expanding the 
binomial distribution and summing the tails as shown in Section 6, 
but this is far too cumbersome a process to be undertaken as 
a regular measure and so the quick approximate normal deviate 
method is used instead. 

Another family of 126 plants was later raised by Sirks by 
self-pollinating Pp individuals in the same pedigree. On this 
second occasion he obtained a segregation of 103 coloured to 
23 white. The expectation on the basis of 3 : 1 is that 31*5 
whites will be found in a family of this size. The deviation 
observed is thus 8-5. The standard deviation for a family of 

126 is Vfxixlle or 4-8606, and so c=^^^“l-749, with a proba¬ 
bility of about 8%. Even this result is reasonably ascribable to 
chance variation when considered by itself. Yet both families 
show a marked deviation in the same direction, so encouraging 
the belief that, though neither is significant when alone, taken 
together the deviations might be too large to attribute to random 
sampling. A method of testing the joint significance is needed, 
and it is not difficult to see how this should be done. 



52 


STATISTICAL ANALYSIS IN BIOLOGY 


The two families are separated because they were obtained 
by pollinating different parent plants. These parents are, how¬ 
ever, of the same genetical constitution (Pp) on our hypothesis, 
and in consequence their progenies may be combined and treated 
as one large family. There is a special test of the legitimacy of 
such addition, but it must be reserved for later discussion 
(Section 48). 

Combining the two families gives a total of 199 plants con¬ 
sisting of 162 coloured and 37 white, where the expectations are 
149-25 and 49*75 respectively. The deviation of observation 
from expectation is thus 12*75. The standard deviation is 

V'|xjxl99, i.e. 6*1084, giving i^084 ^^ ^ proba¬ 

bility of between0 03 and 0*04. This must be judged to be at least 
highly suspicious, and so in the absence of further information 
the departure is taken to be too large to be reasonably attributed 
to sampling error. There is evidence of a real discrepancy 
between observation and h 3 ^othesis. 

20. THE VARIANCE OF SUMS. DIFFERENCES AND MEANS 

In the last example the two families were combined by simple 
addition of the observed segregations and the treatment of the 
resultant as a single large family. The observed deviation and 
the standard deviation were obtained by the application of the 
ordinary binomial formula. This question can, however, be 
approached in a different and more instructive way, viz. by 
attempting to find a method of combining the observed deviations 
and standard deviations directly rather than by adding the 
observed families. 

The deviations of the two individual families were 4*25 and 
8*50 in the same direction. On adding, these give 12*75, which 
was the deviation foimd from the combined data. The two 
individual standard deviations were 3*6997 and 4*8606, while 
that from the combined data was 6*1084. Simple addition clearly 
does not apply in this case. But it does apply to the squares of 
the standard deviations, i.e. to the variances. These were found 
to be 13*6875 and 23*6250 for the separate families and 37*3125 
for the combined data. The first two values give the third by 
simple addition. 

It can easily be shown that these results hold true for all 
similar cases of the binomial expansion. Let and x^ be the 
numbers of white individuals observed in the two families which 
contain in all and k^ individuals respectively. The values of 
p and q are clearly the same for both families. Then the numbers 
of whites expected are pki and pk^ in the two families, giving 



THE VARIANCE OF SUMS, DIFFERENCES AND MEANS 53 


as the observed deviations pki-x^ and pki-x^. The two variances 
3ik 3A; 

are ^ and —Adding the two families, we find x^+Xz whites 
16 16 

out of a total of Jci+kz. The deviation and variance of the com- 

3 (Ic ’^Ic ) 

bined data are p(k^+kz)-{Xi+Xz) and ^ respectively. These 

two values are clearly the sums of corresponding quantities 
obtained from the individual families, since they may be \vritten 

as {jpki-Xi)-v(pki-x,) and 

Thus the variance of the sum of two single observations, 
Xx and Xz, is the sum of the separate variances of Xx and for 
the binomial expansion. This rule is in fact of wider application, 
being true of all independent observations and applying to the 
sum of any number of observations. It may be written as 

y (Xi + X,+ . , . = 

where indicates the variance of the distribution on which 
the observation Xx is presumed to fall. 

It is important to notice that the rule applies to the sums of 
independent observations. A further analysis of the situation is 
of interest since it will help to clarify the mathematical meaning 
of this term ‘ independent ’. Let us consider a sample of size 
n of each of a pair of measurements, x and t/, which are both 
normally distributed. Then the variance of the distribution of 

X will be .. and that of y will be Similarly the 

n-l w-1 

variance of their sum will be . It should be noted 

w-1 

that 




n 


n 


n 


Hence 


-i- S[x+y- [x+y)] ’= )+ (y-y)] * 

n-l n-\ 


n-l n-l n-l 


or 


(* + !/) 


v,+ —r>S[(a;-£)(i/-y)]+ V, 

n—1 


Now a value of y which deviates by dy from the mean y is 
as likely to occur as a corresponding value which deviates by 
~dy. Furthermore, if the value of x is not determined by 
that of y, the deviation dy is just as likely to accompany any value 



54 


STATISTICAL ANALYSIS IN BIOLOGY 


of a; as is -dy. So no matter how many pairs of observations 
are classified on the basis of Xy the mean of the y values in any 
class determined by x will be y within the limits of sampling 
error. Thus for any value of (x-x) the mean value of (y-y) wiU 

be (y-y) or in other words 0, and so the value of -^S[{x-x)(y-y)'\ 

will be 0 within the limits of sampling error. Hence if x and 
y are distributed independently in the sense that the value of 
y is not determined by the value of x or vice versa 


If, on the other hand, the distributions are not independent 
and a large value of x is preponderantly accompanied by a large 
value of y, (y—y) will be positive more often than not when 
(x—x) is positive, and negative more often than not when (x—x) 

2 

is negative. In other words, —will be positive. 

JL 

When the situation is reversed and a large value of x is most 


frequently accompanied by a small value of y, -^S[(x-x)(y-y)'\ 

will clearly be negative. In either case the magnitude of this 
term will increase as the interdependence of x and y becomes 
more pronoimced. 

So the sign and magnitude of this term afford a measure of the 
closeness and nature of the relations between the two quantities 
concerned. It is in fact the mathematical measure of dependence 


and is very widely used as such. jg called the 

X 

covariance of x and y, being written W^. S[(x-x)(y—y)'\ is known 

as the sum of cross products of x and y. 

Then, in general, I^(*+y)=Fa+Fy+21F^, which reduces to 
F(a-+v)“F*+Fj, for the case of independence. 

Similarly, which again reduces to 

F(»_y)=F^+Fj, when x and y are independent. 

One further special case is worthy of note. Suppose we 
wish to find the variance of 2x by this method. 




But S[(x-x){x-x)]=S{x-x)\ and so Hence 

F^=F,+ F,+2F,=4F„ 

which is the rule that has already been found empirically in 
Section 13. 

It is now possible to find the formula for the variance of 
a mean. Consider a sample consisting of n observations of the 
variate x. These n observations are independent in that they 
are made on different individuals or at different times, as for 



SIGNIFICANCE OF MEANS AND DIFFERENCES 


55 


example the 530 maize observations were independent inasmuch 
as they were taken from different plants. Each observation will 
be subject to a variance so that from tlie argument developed 
above their sum will have a variance of nV^. The mean is 
found by dividing the sum of the observations by n and we have 
seen that Vr^=n^V^. Similarly since 5(a:)=nx, or, 

by a rearrangement of the various components, 

^ n 

Thus the variance of the distribution of the mean of n observations 
is the variance of a single observation divided by n. 

21. THE SIGNIFICANCE OF MEANS AND DIFFERENCES OF MEANS 

The method of testing the significance of single observations 
and of the sum of two observations has been illustrated in 
Example 5. The next example will give the procedure appro¬ 
priate to testing means and differences of means. 

Example 6, Table 8 gives the yields, over a certain period of 
time, of ten plants of each of two tomato varieties (Crane and 
Mather’s unpublished data). The results are expressed in kilo¬ 
grams of ripe fruit. Now these two strains, together with others, 
were grown in a lean-to greenhouse in which the lighting conditions 
were far from uniform. So it was deemed necessary to design 
the experiment in such a way that each variety had plants in 
corresponding positions relative to the light, paths and other 
features. Thus each variety had one plant in position 1, and 

another in position 2, and so on. 

First take the individual yields of the two varieties. We 
might for example, ask whether either could be considered to 
be giving an average yield of over 1 Kg. per plant during the 
period over which records were taken. To answer this question 
it is necessary, first, to calculate the mean yield of each variety, 
and next the standard deviation of the distribution which this 
mean supposedly follows. The necessary calculations are set out 
in Table 8. In variety A the ten plants together gave 13-253 Kg. 
of ripe fruit (5a;), which corresponds to a mean yield of 1-3253 
per plant (^). The variance of the yield of single plants must 
be estimated from the data, as the hypothesis under consideration, 
viz. that the mean yield per plant is 1 Kg.^ supplies no expected 
variance To estimate the variance we find that the sum 
of squares of yields [Sx^) is 18-521355. From this must be 
subtracted a correction term to reduce it to the sum of squares 
of deviation from the mean (Section 10). This correction 

5^(a;) 1 3-253=» 17-564201, leaving 5(a:-x)*-18-521365 

A® « ^ ^ 


n 



66 


STATISTICAL ANALYSIS IN BIOLOGY 


-17-564201=0*957154. There are 10-1 comparisons involved in 
the estimate of the sum of squares, as the mean has already 
been found from the data, and so 


7 .=^= 2 :^"= 0.106350 

x» 9 


But the variance of the distribution of the mean of n observations 

is -th the variance of the distribution of single observations, 
n 

and so 

n 10 

Taking the square root, we find that the standard deviation, 
-0-103126. 


TABLE 8 


The Yields of Two Varieties of Tomato {Crane and Mather) 


Foaition 

Yielda of variotiea 

A-B 

1 

A 

B 

1 

1-375 

1-033 

0-342 

2 


1-217 

0-190 

3 

1-068 


0-084 

4 

1-752 

1-615 

0-137 

5 

1-773 

1-693 

0-080 

6 

1-201 

0-673 

0-528 

7 

0-779 

0-840 

-0-061 

8 

1-042 

0-842 

0-200 

9 

1-223 

1-262 

-0-030 

10 

1-633 

1-217 

0-416 

S{x) 

13-263 

11-367 

1-886 

X 

1-3263 

1-1367 

0-1886 

S{x)* 

1 

18-521356 

13-909499 

0-681760 

-SHx) 

n 

17-664201 

12-920869 

0-366700 

S{x~x)* 

0-957164 

0-988630 

0-326050 

F, 

0-106360 

0-109848 

0-036228 

Vs 

0-010635 

0-010985 


«« 

0-103126 

0-104811 

A 


d 

0-1886 

0-1886 

Fd 

0-021620 

0-003623 

Sd 

0-147036 

0-060189 

t 

1-283 

3-133 

N 

18 


9 

P 

0-3-0-2 

0-02-0-01 






SiaNinCANCB OF MEANS AND DIFFERENCES 


67 


Thus on our hypothesis we expect that the yield of the ten 
plants will be distributed about a mean of 1 Kg. with a standard 
error of 0*103126 Kg. It is assumed that distribution is normal. 
The distribution of the mean tends to normality even when that 
of the single observation departs widely from this form. In 
any case, as pointed out in Section 11, the assumption of normality 
leads to no serious error. 

The observed mean shows a deviation of 0*3253 from that 

d 0*32o3 

expected, and so <fQi=-=-“3*154. The standard error has 

^ s 0*103126 

been estimated from the data and so the ratio of deviation to 
standard deviation is a t, not a c (Section 15). This t will have 
9 degrees of freedom, the standard deviation having been 
estimated on the basis of 9 independent comparisons. The 
number of degrees of freedom is denoted by the subscript 
9 attached to the t above. This device, though not in standard 
use, is very convenient. On consulting the table of t it is found 
that a value of 3*154 when based on 9 degrees of freedom has a 
probability slightly greater than 1% of being equalled or exceeded 
as a result of sampling error. This is very small, so the variety 
A must be supposed to have a yield greater than 1 Kg. per plant. 
A similar calculation for variety B gives 5= 1 • 1367, Fa.=0* 109848, 

0* 136V 

Fi-0*010985, 52=0-104811, d-0*1367 and /[9j=^-j^^^»l*304 with 

a probability of between 0*2 and 0*3. Thus though the mean of 
variety A differs significantly from 1 Kg. per plant, that of 
variety B does not do so. 

The next question which may be considered is whether there 
exists a real diff erence in yield between the two varieties. The 
two means have already been found and the difference between 
them (J) is 0*1886 Kg. The variance of this difference will be 
the sum of the variances of the two separate means since the 
measurements of the two varieties might be supposed to be 
independent. Hence Fj^Fg^+F2^“0'010635+0*010985=0*021620. 

Then 147036, and as (?=0 1886, It 

<* V rj 0*1470 

will be observed that the variance of the average difference is 
compounded from the variance of the average yield of variety 
A and that of variety B. Each of these two variances was based 
on 9 degrees of freedom, and so the variance of the difference is 
derived ultimately from 18 comparisons. Hence the t takes 
18 degrees of freedom. A t of value 1-283 for 18 degrees of 
freedom has a probability of between 0*2 and 0*3. We must 
therefore judge that the difference between the yields of the two 
varieties may be accounted for by sampling error and that there 



68 


STATISTICAL ANALYSIS IN BIOLOGY 


is no reason to suppose that a real difference in productivity 
exists. 


The test of significance between the mean yields may, how¬ 
ever, be attempted in another way. In Table 8 the yields of 
plants of the varieties growing in similar positions in the green¬ 
house are listed side by side. Now if there is no real difference 
between the two varieties the mean difference between plants 
growing in corresponding positions should be zero, within the 
limits of sampling error. This hypothesis can be tested. 

The fourth column of Table 8 gives the ten differences and 
shows the calculation of the test of significance. The sum of 
the differences, taking sign into account, is 1‘886, and so the 
mean difference is 0-1886 Kg. The sum of squares of the 
differences is 0-681750, from which must be subtracted the 

1 - 886 ^ 

correction term —or 0-355700, leaving 0-326050 as the 


sum of squares of deviations from the mean. Then the variance 

0-326050 


of the distribution of differences (F^) is estimated as 


9 


i.e. 0-036228, as there are 9 degrees of freedom between the 
10 values, after the mean has been calculated. The variance of 
the distribution on which the mean difference (d) falls is 

Va 0 036228 ^ Vo-003623-0-060189 and 


n 10 
0-1886 


0-003623. 


^[o)'=Q7QgQT^='3*133. The probability of such a i is less than 0*02 


and the mean difference cannot be considered to be 0 within the 
limits of sampling error. This test, unlike the previous one, 
tells us that there is a real difference in productivity between 
the two varieties. Why do the tests disagree ? 

The t's used in the two tests differed in two respects. First 
of all, though having the same numerator, 0*1886, they had 
different denominators. The standard deviation, or alternatively 
the variance, of the mean difference was different in the two 
tests. In the second place it will be remembered that in the 
first test t had 18 degrees of freedom, while in the second test 
t had only 9 degrees of freedom. The variance in the first test 
was based on 18 comparisons between 20 individuals, while in 
the second test only 9 of these comparisons were used. Thus 
in the second test 9 comparisons were rejected as not contributing 
relevant information, and the remaining 9 taken as giving the 
appropriate measure of sampling variance. 

It is not difficult to see how this came about. In the first 
test no account was taken of the position of each plant in the 
greenhouse, while in the later test comparisons were made only 



SIGNIFICANCE OF MEANS AND DIFFERENCES 


59 


between plants in like positions. Thus in the second test 9 degrees 
of freedom were found to be concerned with differences in position 
and as such w'ere rejected. That this really accounts for the 
difference between the two tests can be shown by taking differ¬ 
ences between plants in unlike positions. In Table 9 the 10 yields 
of variety B are rearranged in a random order, while those of 
variety A are as before. The technique of the second test is 
then applied to the data of Table 9. The mean difference is 


TABLE 9 


Rearranged Tomato Data 


Yields of varieties 

A-B 

1 

A 

B 


1-375 

1-217 

0-158 

1-886 


1-217 

0-190 

3. =0-1886 


0-842 

0-226 

S(d^) =1-985842 

-5»(d)=0-355700 

n 

1-752 

0-840 

0-912 

1-773 

1-693 

0-080 

5(d-d)*-l-630142 

1-201 

1-253 

-0-052 

rd=0181127 

0-779 

0-984 

-0-205 

Fj=0018113 

1-042 

1-615 

-0-573 

«d=0-134585 

1-223 

0-673 

0-550 

«=1-401 

1-633 

1-033 

0-600 

2S^=9 

P-0-2-0-1 


again 0-1886, but the variance of the distribution is now 0-181127, 
giving as the standard deviation of the mean difference O'OISI 13, 

i.e. 0-13459. Then 1-401, which has a probability of 


between 0-2 and 0-1. The difference is no longer significant 
when the positions of the plants in the greenhouse are rearranged. 
In other words, the wrong 9 degrees of freedom have been chosen 
for use in the test. If position had played no part in determining 
the yields, i.e. if conditions had been constant all over the green¬ 
house, then this test would have given the same result as the 
other two, which would, of course, have agreed with each other. 
It is solely the effect of position which makes it profitable to 
distinguish between the various degrees of freedom. This may 
be expressed in another way, viz. that, since position affects the 
yields of the plants, the observations of the varieties are not 
wholly independent and the calculation of the variance of the 
mean difference by summing the variances of the two means is 


STATISTICAL ANALYSIS IN BIOLOGY 


60 

not legitimate. When the dependence due to position effect has 
been removed, as in the second calculation, the ten observations 
used are really independent and the test of significance is valid. 

This result clearly implies that degrees of freedom can have 
special individual properties and that, on the basis of these 
individual properties, some may be rejected from use and others 
chosen as appropriate to the test in hand. This isolation of 
degrees of freedom is of immense importance in statistical analysis 
and must next be examined in greater detail. 

REFERENCES 

erUKS. M. j, 1929. Mendelian factors in Datura^ III. Oenetica, 11» 
257-06. 



CHAPTER VI 


DEGREES OF FREEDOM AND THE ANALYSIS 

OF VARIANCE 

ft. THE INDIVIDUALITY OF DEGREES OF FREEDOM 

IN discussing degrees of freedom we may start with the simplest 
case, viz. that of two observations. Assuming that the mean and 
variance are not fixed by hypothesis, there will be one comparison 
left for the purpose of estimating the variance once the mean has 
been calculated. Let the two observed values be and 
respectively. Then the mean is clearly The sum of 

squares of deviations from the mean may be found from the 

formula which applied to the present case gives 

71 

5.^.=ai*+a,2-^(ai+a,)2=i(ai2-2aiOs+a,2)=J(ai-aj)* 

where 8.8. stands for the ‘ sum of squares Thus the sum of 
squares corresponding to the single degree of freedom is based on 
the rather obvious comparison afforded by the difference between 
Uj and 

The next simplest case is that of tliree observations, Ui, a* and 
a,. The mean of the three is ^(ai+aa+aa) and their sum of squares 
of deviations from the mean is 

Ctj^+Ctj^+Cta*— 

Now this sum of squares is based on two comparisons, and it 
has been shown already that the comparison between ctj and Ui 
corresponds to a sum of squares of \{<ii-a^Y. Hence the remaining 
comparison is 

-J{4ai2+4a22+4a32-4aiOa-4ai03-4<Zaa9-3ai2-3aa*+6aiaa) 

«J(ai2+aj2+4a32+2aia2-4aia3-4a2a3)=J(ai+a2-2a,)2 

This is clearly a simple comparison between the third observa¬ 
tion and the mean of the other two. Thus the two degrees of 
freedom have quite distinct meanings. One is concerned with 
the comparison of fli and cii. The other is concerned with the 
comparison of flj and the other two taken jointly. It should, 
however, be noted that other possibilities exist for separating the 
2 degrees of freedom, e.g. the first comparison could be made 
between ai and or between and as, instead of between 
dy and Us. The three ways of partitioning the sum of squares 
Implied by these three choices of the first comparison are 

(i) \{dy~d^Y and J(ai+a,-2aa)2 

(ii) Kcti-as)* and |(a,-2a2+aj)* 

(iii) i{a,-a,)* and J(-2ai+a,+a,)* 

61 


6 



62 STATISTICAL ANALYSIS IN BIOLOGY 

A great number of further less obvious and generally less useful 

partitions could also be devised. 

Cases of four observations offer a still greater variety of 
useful partitions. The mean is total 

sum of squares ai-+a 2 "+U 3 ^+fl 4 "- 4 (^^i‘'’^ 2 +^* 3 +^ 4 )^ 

=J(3ai2+3a22+3a32+3ai2_2(/ia2-2aia3-2a,ai-2a2a3-2a2a4-2a3a4) 

Continuing along the line of the previous analyses the first 
two comparisons, between and a^, and between Ui, and a^, 
have a joint sum of squares of §(ai^+ct 2 ^+a 3 ^-^ii® 2 "®i® 3 “® 2 f* 3 )> 
leaving as the sura of squares for the third comparison 

^(9ai2+9a^2+9a32+9a*2_6aia2-6aia3-6aia4-6a2a3-6a2a4-6a3a4 

-Sai^-Saj^-Saa^+Saiaa+Saiaa+Saaaa) 

(a j 2 +a 2 2 +a ^ 2 + 9 (j ^ 2 + 2a 2 + 2 a 3 - 6 a *+ 2 a 2 O 3 - 6 a *- 6 a ,a *) 

=T2(«l+<*2+«3-3a4)' 

This is a comparison between a* and the mean of the first three 
observations. 

Just as there were three possible partitions of the 2 degrees 
of freedom from three observations, there are twelve possibilities 
of this type with four observations, viz. : 

(i) i(al-a 2 )^ J(ai+a2-2a3)=, 

(ii) J(ai+a 2 - 2 ai) 2 , -j^(ai+a2-3a3+a4)2 

(iii) |(al-a3)^ ^^(ai+at+as-Sa,)* 

(iv) \{a^-aiY, i(ai+a 3 - 2 a 4 )*, -f-^{a^-Zai+ai¥at)^ 

(v) Moi- 2 aa+a 4 )^, -^{a^+a^-Za^+a^)^ 

(vi) |(Oi-a 4 ) 2 , J(ai- 2 a 3 +a*) 2 , ■iV(cti- 3 aa+a,+a 4 )* 

(vii) Ka 2 -a 3 ) 2 , J(- 2 ai+a 2 +a 3 )®, 1 ^(^ 1 + 02 + 03 - 304)2 

(viii) ^( 02 - 03 ) 2 , ^( 02 + 03 - 204 ) 2 , iV(- 3 ai+at+« 3 +«i)* 

(ix) Kaa-a*)^, |(- 201 + 03 + 04 ) 2 , 1 ^( 01 + 0 .- 303 + 04)2 

(x) ^( 02 - 04 ) 2 , 1 ( 0 .- 203 + 04 ) 2 , iL(- 3 ai+a 2 + 03 +a 4 )® 

(xi) ^( 03 - 04 ) 2 , J(- 201 + 03 + 04 ) 2 , ^( 01 - 303 + 03 + 04)2 

(xii) ^( 03 - 04 ) 2 , ^(- 203 + 03 + 04 ) 2 , ^^(-Sai+o.+aa+a.)* 

But these do not end the simple ways of partitioning the four 
observations. When the first comparison is of Oj with O 2 , the 
second degree of freedom might not be calculated as ^( 01 + 03 - 203 ) 2 , 
but might rest instead on the comparison ^( 03 - 04 ) 2 . The third 
degree of freedom will then have as its sum of squares 

^(9012+9032+9032+9042-60103-60103-60104-60,03-60204-60304) 

“i(®i®+®»^”2aia,)-^(o32+a42- 20304 ) 

which reduces to 

-jL(3ai2+3o,2+3a32+3a42+6oiO,-6oia3-6ai04-6o3a3-6oaa4+6a3a4) 

=J(ai+a2-a3-ff4)' 

This is the comparison between the sums, or means, of the first 
two observations, themselves compared in the first degree 01 
freedom, and of the last two observations, compared in the 


THE PRINCIPLES OF PARTITION 


63 

second degree of freedom. There are clearly three partitions of 
this kind : 

(i) J(ai-a8)2, ^(a3-aJ2^ \(a-^+a^-a:^~a^Y 

(ii) |(ai-a3)2, 

(hi) |{a2-a3)^ J(aj-a8-a3+aJ2 

There is still another valuable partition in which all the 
3 degrees of freedom rest on comparisons involving the four 
observations. In this case the sums of squares are 

i{ai~ai+a^-a^Y> U^^i-ai-a^+aiY 
That these expressions together equal the total sum of squares 
may be shown by expansion and addition. Though this is not 
perhaps such an obvious method of partition it is, as will be 
seen in later sections, the one most used in analysis. 


23. THE PRINCIPLES OF PARTITION 


The components into which a sum of squares can be divided 
are very simple. Indeed, it is not difficult to write down a set 
of them for the partition of any given sum of squares. They 
have, however, special properties which must be ascertained 
before such rapid partition is, in general, possible. 

It will be observed that each component has two parts, the 
initial fraction and the squared portion, which determines the 
comparison on which the item is based. Thus for the comparison 
between a, and a* the sum of squares is the initial 

fraction being ^ and the squared portion resting on the comparison 

Now the comparisons are always linear functions of a,, a,, a„ 
&c., that is to say, they always involve a, rather than aY or 
aYy &c. So we may write down a general form to include all 
sets of comparisons : 


J3fl3+/jl3®3+ • • • 


3^{n-l)“^(n-l)l®l+^(n-l)2®3+^{n-l)3®3+ • • • +^(n-l)n®» 

as there are n-l different independent comparisons possible with 
n observations. The important parts of these formulae are the 
^’s—the coefficients by which the a values are multiplied to give 
the comparison used. Thus in the simple case of two observations 
we have seen that the contribution to the sum of squares is 
based on (ai-ctj). Then kn^^l and Taking the case of 

three observations, one possible method of partition is into 
components based on (ai-a,) and (ax^-a^-2a^). Here we have 





and 




64 


STATISTICAL ANALYSIS IN BIOLOGY 


With four observations, where the partition is on the basis of 
a:i=ai-a„ the k values are 

fct,=-l k 


^21=1 
^31=1 


12-* /vi2-=0 ^14“^ 

^2J—1 ^23““^ ^24”^ 

^S3“l k^,^l A:«i=--3 


33 


84 


One characteristic property of the k values is immediately 
obvious, viz. that in any comparison they add up to 0, or 

5(iS:)=0 

This is easily shown for the three comparisons of the preceding 
paragraph. 

1st comparison iSf(A:)=l-l+0+0=0 

2nd ^{A:)=l+l-2+0=0 

3rd 5(;b)=l+l+l-3-0 

The second characteristic is not so obvious. It relates to 
the cross products of k'a in pairs of comparisons. Now it has 
continually been emphasized that the various comparisons of a 
set must be independent of one another or, as it is often called, 
orthogonal, if a successful partition is to be achieved ; and it 
will be remembered (Section 20) that the test of independence 
of any two quantities is that their cross product should be zero 
within the limits of sampling error. The way in which this 
cross-product test is applied to a pair of x functions (or com¬ 
parisons) is to multiply the two i’s, one from each function, 
associated with Gi, also the two associated with a„ and so on, 
and add the products together. If the functions are orthogonal 
this sum is 0. 

Taking the three functions used above for showing that iS'(A;)-0, 
the k values give as their cross products 

» ^H^ll“I| ^12^22“ i-* ^14^*4“® > S{^kik^^0 

J ^11^32“~1-* ^13^3S“^1| knkjn™01 S{kik3)’^0 

^21^31“ !■» ^8S^83“~2, ^24^34^^} S{kf,kj^)^0 

The sums of all three cross products are 0 and the comparisons 
are independent. Such a set of independent comparisons is often 
spoken of as a set of orthogonal functions. 

The comparison term of the sum of squares formula has 
occupied our attention so far, but there still remains the question 
of the initial fraction to be considered. This, too, is dependent 
on the k values of the orthogonal functions. Turning once more 
to the three comparisons 

(tti-a*), (ai+a,-2as), (ai-Hat+as-3o4) 

between four observations, we have, it will be remembered, 
J, J and iV ^ initial fractions to the formulae. In every 


THE AHAIiYSIS OF VARIANCE 


65 


case the numerator is 1 and the denominator is S{k^). In Xi ; 

^*i 3 =/.*ii =0 and S(k^)==l+l+0+0=2, so giving ^ as 
the fraction. In x^; kz 3 =- 2 , ki^=0 so that S{k^) = 

l+l+4+0=6. Similarly for the third component /: 3 i=^ 32 =^ 33 =l, 
/:34=-3 and -S(A:2)=1+1+1+9*12, the initial fraction being iV- 

24. THE ANALYSIS OF VARIANCE 

We have now found the three characteristics of the com¬ 
ponents of a sum of squares, viz. (a) S{k)=0 for each comparison, 

(b) iS(^:i&a)=0 for all pairs of comparisons, and (c) ^ 

initial fraction of any component of sum of squares. By using 
these characteristics it is possible to partition the sum of squares 
in any way which seems appropriate to the analysis in hand. 

It is clear, however, that although the separation of every 
individual degree of freedom is possible, such a complete partition 
is seldom necessary. As a case to point, consider an experiment 
of the type discussed in Section 21. The problem was, it will 
be remembered, that of testing a possible difference in yield 
between two varieties of tomato, when the different plants of 
each variety grow in positions of var)dng fertility. 

As a simple case for discussion, consider two plants of each 
variety. There will be four plants in all, that of variety A in 
position 1 and that of variety A in position 2, that of variety B in 
position 1 and that of variety B in position 2. The initial, or 
null, hypothesis is that all varieties have the same yield, and 
we wish to determine whether the observed results agree with 
this view, within the limits of sampling error. A low probability 
will indicate that sampling error is incapable of explaining the 
departures from equality of yield, and in consequence one type 
must be supposed to be more prolific than the other. 

The yields of the four individual plants will be denoted as 
®Ai» and aB 2 where the subscripts A and B refer to variety 

and 1 and 2 to position. There are 3 degrees of freedom between 
the four observations. Of these, 2 can be related to comparisons 
of obvious interest, viz. the differences between plants of the two 
varieties and between plants in the two different positions. The 
former is a comparison of A with B and so must be based on the 
expression Similarly the position comparison 

will be (a^i+a^i)-(a>i 2 +OB 2 )* The third degree of freedom neces¬ 
sary for the completion of the analysis can be found by the 
methods of Section 22 to be It will be 

observed that these constitute a set of orthogonal functions as 
S{k)~0 for each expression and *S(/:iX:a)=0 for each pair of expres¬ 
sions. Furthermore, each comparison gives and so th^ 



66 


STATISTICAL ANALYSIS IN BIOLOGY 


initial fraction will in all cases be The three contributions to 
the sum of squares are thus : 

(i) 

(ii) 

(iii) 

Now these three comparisons are of very different importance 
for our purpose. The first one obviously depends on the difference 
whose significance is our central problem. The second one 
concerns the positional difference in fertility and hence is irrele¬ 
vant. The third one, like the first, is of importance but in 
a different way. It wiO be observed that just as the first com¬ 
parison may be written third one can be 

made to take the form (<Ia\~^bi) So the first represents 

total difference in varietal yield while the third is the difference 
of this varietal difference in the two positions. It is in fact 
a measure of the variation in the varietal difference, and hence 
gives information necessary for the calculation of the mean 
square measuring uncontrolled variation, i.e. the error mean 
square with which the mean square between varieties must be 
compared in order that its probability may be assessed. 

When there are three plants the analysis is a simple extension 
of that given above. There are 5 degrees of freedom to be 
accounted for. One is obviously dependent on the total varietal 
difference, and is based on the comparison 

As there are now three positions, two degrees of freedom are 
assignable to the comparisons between them. A typical partition 
would be into the elements 


{a^i+aBi)+(a^2+®B2)“2(a^s+afl3), 

but the precise method of decomposition is not important since 
these comparisons are only picked out in order that they may 
be rejected as irrelevant to our purpose. The two remaining 
degrees of freedom are (o>a\~^b\)^(<^a 2 '~^B 2 ) ^-nd 

if the position comparisons are partitioned as above. All com¬ 
parisons satisfy the criteria of orthogonal functions. The cal¬ 
culation of for each of them shows that the contributions 

to the sum of squares will be : 

Varieties : i[(®^i-«Bi)+(®A 2 -aB 2 )+(o^ 8 -«B 8 )]* 


Positions : 
Error : 


llV[{®v41+®Bl)+(®.42+®B2)“2(a^3+UB3)]* 

fi[{®^l”®Bl)~(®A2~®B2)]* 



THE ANALYSIS OF VARIANCE 67 

As in the simpler example, there are three t 3 rpes of com¬ 
parisons, and these three have exactly the same importance as 
before for our purpose. The position comparisons may be 
pooled and rejected together. In just the same way the two 
error comparisons can and should be used jointly in estimating 
the sampling error to which the varietal difference is subject. 
So even if the position comparisons are found separately they 
will ultimately be summed, and even if the error comparisons 
are found separately they will be summed too. Hence a direct 
method of finding the joint sum of squares would be advantageous 
in each case. 

Taking the two position comparisons first, it will be seen that 

i[(a^i+aBi)-(cE^2+azi2)]^+TT[(a^i+am)+(a.i2+«a2)-2(a^3+aB3)]^ 

+iV[Ki+aBi)2+(a^2+a^)2+4(a^3+ae3)='+2(a^i,+am)(a.42+aB2) 

“A[4(a^i+aB,)2+4(a,j2+«n2)^+4(a^3+aa3)^-4(a,„+UB,)(a^2+aB2) 

”4(^41+01^1 )(®l3+®B3)“4(a^2+®B2)(®^I3+t^B3)] 
"TV[6(®^l+®i?l)*+6(a^2+^*B2)*'*'^(®.48+'^B3)^]~T^2[2C0t.4i+UBl}* 

+2(a^2+®B2)^+2(a^a+a5Q)2+4(a^i+aBi)(a42+®B2)+4(a^i+aBi)(a43+aB3) 

■*‘4(®/12'*'*^B2)(^/13'*'®B3)] 

-i[(a^i+aBi)2+(a^2+«B2)*+(a43+aB3)^-J(«.ii+aBi+a.42+«B2+a.43+aB3)" 


This is the form of the familiar calculation of the sum of squares 
from a series of observed values, except that there is a fraction 
i in the first term. Such a divisor is characteristically the 
number of observations summed to give the item which is squared. 
In the present case each position total is obtained by summing 
the yields of two plants. Hence to find the sum of squares for 
all the position comparisons taken together, the three position 
totals are squared, divided by 2 and summed. The correction 


terra, -S%a), is then deducted. In the present case there are 
n 

six observations in all, and so n is 6 in the calculation of this 
correction. It will be seen that this agrees with the general rule 
about divisors, viz. that any square is divided by the number 
of observations which have been summed to give the item which 
is squared. 

In the same way it can be shown that the total sum of squares 
of the error comparisons is given by 

K(®v4l~®Bl)^+(®.42'"®B2)^+(®.48”®fl8)^]”i[®/41“®Bl'*'®yl2“®B2+®.43”®B3]* 

This resembles the sum of squares for position except that 

replaces ( 0 ^ 11 +^ 31 ), and so on. Here the correction 
term is itself the sum of squares for varietal difference. 



68 


STATISTICAL ANALYSIS IN BIOLOGY 


In the tomato experiment described in Example 6, plants of 
each of two varieties were grown in ten positions. There are 
thus 19 degrees of freedom in all. By analogy with the simpler 
cases already considered, tlus© 19 degrees of freedom are sub¬ 
divisible into three groups, (a) 1 for the comparison between 
varieties, (b) 9 for differences between the fertilities of the 
different positions and (c) 9 for variation of the varietal difference 
in the ten positions, i.e. for error. The calculation of the sums 
of squares appropriate to each of these groups can be made by 
extending that used in the case of three positions. The data are 
set out for the calculation in Table 8. 

First of all the comparison of the yields of the two varieties 
will be (a^i+u^i 2 + - • • • * • +^*bio)- This gives 

iS(^^)=20, and so the sum of squares is 

• • * +®cio)]^ 

Arithmetically this becomes ^(13-253-ll-367)2=0-177850. W© 
may note in passing that this value could have been found in 
a somewhat different way analogous to that used for the position 
total. Putting ct^i+u^ 2 + • • • 

agi + a ^ 2 + . • . + o,b\o’°^b 

we can rewrite the formula for the sum of squares, for 

It will be seen that the correction term is the same as that used 
in the calculation of the position sum of squares. The first term 
consists of the sum of squares of the two varietal totals divided 
by the number of plants on which each total is based, i.e. 10. 
Arithmetically this is 

tV( 13-2532+11-3672)-J^(24‘620)2=30*485070-30-307220=0-177850 
as before. 

The sum of squares for position is found from the summed 
yields of each position. These numbers are squared and their 
squares summed. Each is based on two plant yields, and so the 
sum of squares will be divided by 2. The correction term is as 
above, viz. the square of the total yield divided by the number 
of plants. So arithmetically we find the position sum of squares 
to be 

^(2-408*+2*6242+ . . . +2-850*)-^(24*620)* 
=32-035979-30-307220=1-728769 

Lastly, we must calculate the sum of squares due to positional 
variation in the varietal difference. As already shown, this 
calculation is just the same as that for positions, but the differ- 



THE ANALYSIS OF VAMANOB 


69 


ences between the yields of the two plants in each position are 
used instead of their sums. So the sum of squares is 

J(0-3422+0-1902+ . . . 0-4162)-2V(1’886)2 

=0-340875-0a77850 

=0*163025. 

Table 10 gives these results in the form of what is termed 
an analysis of variance. The sums of squares are found in the 
second column, while the third column gives the numbers of 
degrees of freedom (iV^). The ratio of a sum of squares to its 
corresponding number of degrees of freedom gives the mean 
square as shown in the fourth column. It will be recalled that 
a mean square is an estimated variance, and the term is extensively 
used in connexion with the type of analysis now under discussion. 


TABLE 10 

Analysis of Variance of Tomato Yields 



Sum of 
Squares 

Degrees of 
Freedom 

Mean 

Sq\iare 

Prob- 

It«m 

(S.S.) 

(N) 


i ability 

Varieties 

0 177850 

1 

0-177850 

3*133 0*02-0-01 

Positions 

1-782759 

9 

0-198084 


Varieties-Positions 

(Error) 

0*163025 

9 

0-018114 


Total 

2-123634 

19 




At this stage the test of significance, towards which the whole 
analysis has been leading, becomes possible. We are interested 
in the varietal difference whose mean square is based on one 
comparison, so a i test is suitable for our purpose. The denomina¬ 
tor of t is clearly to be fmmd by taking the square root of the 
error mean square. Then 


/0-177850 

0*018114 


V9*8184-3*133 


The probability of finding such a poor fit by chance is very small 
(0*0^0*01). The varieties must be supposed to have genuinely 
different yields. 

The value of t found by the analysis of variance is exactly the 
same as that found by a slightly different method in Example 6 ; 
though it will be observed that the two root mean squares whose 
ratio gives t are different in the two calculations. In Section 21 
the numerator of t was found as the difference between the mean 
yields of the varieties, i.e. as where is the summed 

yield of A plants and Oa that of B plants. Similarly the de¬ 
nominator of t was Vwhere is the sum 

of squares of the difference in yields of A and B plants in 



70 


STATISTICAL ANALYSIS IN BIOLOGY 


corresponding positions. The anatysis of variance provides 

as the numerator of t and V^S(a^-aB)^ as its 
denominator. Thus both numerator and denominator as found 

in Section 21 have times the value of those given by the 

V 5 

analysis of variance. The two methods of analysis give results 
differing by a constant fraction which cancels out when the test 
of significance is performed. 

Before leaving this example a further computational point 
may be noted. There are 20 observed yields and a total sum of 
squares for 19 degrees of freedom may be found directly from 
these 20 values. This is clearly the total sum of squares of the 
whole experiment and so represents the number which has been 
partitioned in the analysis. Therefore, when correctly calculated, 
the three components already found should agree with this total 
on summing. Arithmetically the total sum of squares is 
(l‘3752+l-4072+ . . . +1-2232+1-6332+1-0332+1-2172+ . . . +1-252* 
+ 1-217*)-2V(24-620)*=2-123634, which tallies with the total of the 
three components. Thus we can determine any one of the three 
components by finding the total and subtracting from it the 
other two component items ; this is the method normally used 
for the calculation of the error sum of squares, which is not 
always so easy to find directly as in the present example. 


25. INTERACTIONS BETWEEN MAIN EFFECTS 

Two main effects were recognized in the tomato example. 
These were the difference between the varieties and the differences 
due to position in the greenhouse. In addition, a third term 
appeared in the analysis, depending on the effect of position on 
the varietal difference. This item was used as an estimate of 
error, in that it was taken as a measure of the fluctuation of 
yield differences due to factors over which no experimental 
control could be exercised, though of course it is a valid estimate 
only if the effects of position and variety are additive. In such 
a case the varietal difference in any position is independent of 
the position effect. 

If there were any other method of estimating the error 
variance in the tomato experiment it would be possible to use 
the third term of the analysis for the specific purpose of testing 
the additive nature of the two main effects, or finding out whether 
they ‘ interacted * to a significant extent. In general, terms of 
this kind are recognized in the analysis of variance and are called 
interactions. The one under consideration would be described 
as a first-order interaction between variety and position. 

As the number of recognizable main effects increases the 



INTERACTIONS BETWEEN MAIN EFFECTS 


71 


number of interactions increases even more rapidly, and some of 
them will be of an order higher than the one in the tomato 
analysis. Suppose, for example, we have an experiment in 
which observations are made on eight individuals, the first of 
which had been subjected to treatments A, B, and C, the second 
to A and B, the third to A and C, the fourth to B and C, the 
fifth to A alone, the sixth to B alone, the seventh to C alone, 
and the eighth to no treatment at all. There are three main 
effects due to the three treatments A, B, and C. Their sums of 
squares will be derived from the comparisons 

(^ABC'^^AC'^^BC'^^c)~{^AB'^^A'*'^B'^^l) 

where a^Bc represents the observation made on the individual 
who receives all three treatments, and so on, a, being that from 
the untreated individual. There are 4 more degrees of freedom 
to be accounted for, and these will be interactions of various 
sorts. They can be assigned to particular comparisons in the 
following way, which is due to R. A. Fisher. 

The treatment A is denoted by that letter while its absence 
is denoted by 1. Similar symbolism is used for the other two 
treatments, so that any combination of the three may be noted 
by a corresponding combination of A, B, C, and 1. The k co¬ 
efficients assigned to the eight individual observations in the 
comparison giving the main effect of A can then be found 
by expanding the expression {A-1)(B+1)(0+1) into the form 
ABC+AB+AC+A-BC-B-C-1 as already used. The main effects 
of B and C are similarly found from the expansions of 

{A+1)(B-1)(0+1) and (A+1)(B+1)(C'-1) respectively. 

It will be seen that the factor whose effect is being considered 
is represented by a bracket containing a difference, while those 
whose effects are not under consideration are used as sums. It 
is not difficult to see how the expressions giving the interaction 
comparisons are found. A first-order interaction depends on the 
interplay of two main effects, the third being left out of con¬ 
sideration. The corresponding expression thus contains two 
brackets with differences and one with a sum. The first-order 
interaction of A and B will then be the expansion of 

(A-l)(B-l)(t7+l) giving ABC+AB+C+l-AC-BC-A-B 
In this case k^^c^kBc^k^i^kn^-l. The 

other two first-order interactiffhs are (.4-l)(B+I)(C-l) and 
(A+1)(B-1)(C-1) respectively. 

The seventh degree of freedom is then seen to bo a second- 



72 


STATISTICAL ANALYSIS IN BIOLOGY 


order interaction of A, B, and C, and will be derivable from the 
expression {^-1)(B-1)(C'-1). For this comparison 

^ABc~^A~^D~^c~ 1 s-nd h^Q=k^Q=lcQQ=]ci=~\, 

The k coefficients obtained in this way are set out in Table 11» 
from which it is readily shown that the comparisons are orthogonal 
and that the divisor is 8 in each case. 


TABLE 11 

k Coefficients of 7 Orthogonal Functions for the Analysis of 

3 Treatments each at 2 Levels 



ABC AB AC BC A B C 1 ^ 

S[k) 


fA 

1 1 1-1 1 -1 -1 -1 i 

0 

8 

Main EfTocts < B 

1 1-1 1-1 1-1-1 

0 

8 

Ic 

1-1 1 I -1 -1 1-1 

0 

8 

Ist-order 

1 1 -I -1 -1 -1 1 1 

1 _1 1 _1 _1 1-1 1 

0 

0 

8 

8 

interactions 

1-1-1 1 1-1-1 1 

0 

8 

2nd-order ABC 




interaction 

1 -1 -1 -1 1 1 1-1 


6 


Though these coefficients are not applicable to cases with 
more than 1 degree of freedom for each main effect and inter¬ 
action, the method of calculation of the sum of squares corre¬ 
sponding to the various comparisons is suggested by these 
formulae. The main effect of A will be obtainable from the 
totals found by summing over the various B and C classes. The 
corresponding C classes will be summed for the determination of 
the interaction between A and B. This will leave a two-way 
table having three kinds of degrees of freedom, one being the 
main A effect obtainable from one margin, a second kind from 
the other margin being the main B effect, and lastly a third typo 
for interaction of A and B. This last group can be found as the 
difference of the total sum of squares of the table and the two 
main effects calculated from the margins. 

The application of these methods of calculation to a more 
complex case may be illustrated by the following results taken 
from an analysis of barley yields, in bushels per three acres, 
published by Immer, Hayes and Powers. 

Example 7. Data on the yields of five varieties when grown 
at each of six places in the St^ of Minnesota during the years 
1931 and 1932 are given in Table 12. There are 6x6x2 observa¬ 
tions in all, and so the total analysis will contain 69 degrees of 
freedom. The first task is that of partitioning these degrees of 





rNTEBACTIONS BETWEEN MAIN EFFECTS 73 

freedom infc) the components appropriate to the various com¬ 
parisons which might be interesting. 


TABLE 12 


Barley Yields in Bushels per Three Acres {Immer, Hayes, and Powers) 



Place and 

V’arieties 


Year 








Manchuria 

Svansota 

Velvet 

Trebi 

Peatland 


1931 . 

810 

105-4 

119-7 

109-7 

98-3 

1 

1932 . 

80-7 

82-3 

80-4 

87-2 

84-2 

2- 

fl931 . 

146-6 

1420 

150-7 

191-5 

145-7 


[1932 . ! 

100-4 

115-6 

112-2 

147-7 

108-1 

3- 

[1931 . 

82-3 

77-3 

78-4 

131-3 

89-6 


[1932 . 

103-1 

105-1 

116-5 

139-9 

129-6 

4- 

fl931 . 

119-8 

121-4 

124-0 

140-8 

124-8 


1932 . 

98-9 

61-9 

96-2 

125-5 

75-7 

5* 

[1931 . 

98-9 

89-0 

69-1 

89-3 

104-1 


1932 . 

66-4 

49 9 

96-7 

61-9 

80-3 

6-1 

[1931 . 

86-9 

77-1 

78-9 

101-8 

96-0 

1 

[1932 . 

67-7 

66-7 

67-4 

1 

91-8 

94-1 


We first of all note that summation of the yields of each 
variety over places and year leaves five varietal totals, which 
will supply 4 degrees of freedom for varietal differences. A 
similar procedure, but summing over varieties and years, gives 
SIX place totals with 5 degrees of freedom for differences between 
places. Finally, summing over varieties and places gives two 
annual totals with 1 degree of freedom for the difference between 
the two years. So 10 degrees of freedom are accounted for by 
the main effects. 

To obtain the material for calculating the first-order inter¬ 
action of varieties and places, it is necessary to sum over years. 
Adding the 1931 and 1932 yields of each variety in each place 
leaves a table with thirty entries (5x6). Of the 29 degrees of 
freedom which it contains, 9 are assignable to main effects, viz. 
4 to varieties and 5 to places. These are found from the row 
and total columns in the two margins of the table. The remaining 
20 degrees of freedom, obtained by subtracting these 9 from the 
total of 29, relate to the first-order interaction under consideration. 

A 6x2 table is obtained when summation is made over places. 
Of the 9 degrees of freedom which this contains, 4 are attributable 
^ varietal differences and 1 to years, leaving 4 for the first-order 
interaction of variety and year. Finally, with summation over 
varieties a 6x2 table containing 11 degrees of freedom is left. 
Deduction of the 5 place and 1 year degrees of freedom gives 




74 


STATISTICAL ANALYSIS IN BIOLOGY 


5 for interaction of place and time. The three first-order inter¬ 
actions have together taken up 29 degrees of freedom. With 
the 10 for main effects this leaves 20 to be accounted for by the 
second'Order interaction of variety with time and place. 

There is a very simple rule for determining the number of 
degrees of freedom appertaining to any interaction. The varieties 
take 4 and the places 5 degrees of freedom. Then their inter¬ 
action takes 4x5. Similarly, years take 1, so the variety-year 
interaction will have 4 and the place-year interaction 5 degrees 
of freedom. The second-order interaction has 4x 5x 1 or 20 degrees 
of freedom assignable to it. 

So we reach the analysis of degrees of freedom shown in 
Table 15. The calculation of the corresponding sums of squares 
proceeds along much the same lines. 

TABLE 13 

A, Varietal Total Yields of Barley 

Manchuria Svansota Velvet Trebi Peatland 

1,132-7 1,093-6 1,190-2 l,4J8-4 1,230-5 

B. Place total yields of Barley 

1 2 3 4 6 6 

928-9 1.360-4 1,053-1 1,089-0 805-6 828-4 

C. Year total yields of Barley 

1931 1932 Total 

3,271-4 2,794-0 6,065-4 

The varietal sum of squares is found by summation over places 
and years (Table 13a). Each total is composed of 12 values, 
and so after squaring and summing the result must be reduced 
to tV- The term correcting for the use of 0 as a working mean 
is, of course, found by dividing the square of the grand total by 
the number of observations, viz. 60. Thus the sum of squares 
for varieties is : 

-i>5(1,132-7*+1,093-62+1,190-2«+1,418-4»+1,230-5VVo{6,065-4)* 

or 5,309*9723 for 4 degrees of freedom. 

Table 13b shows the place totals obtained by summing over 
varieties and years. Each total comprises 10 values, and so the 
divisor of the squared values will be 10. The correction term is 
as before. This gives 

-1^(928*92+1,360-42-1- ^ ^ -i-828-4a)-^(6,065*4)2=21,220*9040 

as the sum of squares in question. Table 13c supplies the materi^ 
for calculating the “ years ” sum of squares. The divisor is 
6x5, i.e. 30, and using the same correction term the result of the 
calculation is 3,798*5126. 


Total 

6,065-4 

Total 

6,065-4 



INTERACTIONS BETWEEN MAIN EFFECTS 


75 


TABLE 14 

A. Varief^’Place Classification of Yield in Barley 


Variity 

Plac'- 

Manchuria 

Svansota 

Velvet 

1 

Trebi 

Pcrttland 

Place 

total 

1 

lGl-7 

187-7 

200-1 

196-9 

182-5 

928-9 

2 

247-0 

257-5 

262-9 

339-2 

253-8 

1,360-4 

3 

185-4 

182-4 

194-9 

271-2 

219-2 

1,053-1 

4 

218-7 

183-3 

220-2 

2C6-3 

200-5 

1,089-0 

5 

165-3 

138-9 

165-8 

151-2 

184-4 

805-6 

6 , 

154-6 

143-8 

146-3 

1 

193-6 

190-1 

828-4 

Variety total 

1,132-7 

1,093-6 

1 1 

1,190-2 

1 

1,418-4 

1 1,230-5 

f 

6,065-4 

B. Variety-Year Classification of Yield in 

Barley 


Variety 

1 






Year 

Manchuria 

1 

Svaneota 

Velvet 

Trobi 

Poatland 

Year 

total 

1931 

615-4 

612-2 

620-8 

764-4 

658-5 

3,271-4 

1932 

617-2 

481-4 

669-4 

654-0 

672-0 

2,794-0 

Variety total 

1,132-7 

1,093-6 

1,190-2 

1,418-4 

1,230-5 



C. Place-Year Classification of Yield in Burley 


Place 

Yeor 

1 

2 

3 

I 

4 

5 

6 

Year 

total 

1931 

1932 

614-1 

414-8 

776-5 
683 9 

458-9 , 
694-2 

630-8 

458-2 

450-4 

355-2 

440-7 

387-7 


Place total 

928-9 

1,360-4 

1,053-1 

1,089-0 

805 0 

1 

828-4 

« 

6,065-4 


Turning next to the first-order interactions between varieties 
and places, we sum over years to get Table 14a, in which each 
entry is the sum of two observed values. The sum of squares 
calculated from this table corresponds to 29 degrees of freedom. 
The divisor is 2 and the correction term as before ; so the sum 
of squares is 

i{161-72+247-02+ . . . +154-62+187-72+ . . . +190-12)-5V(6,065.42), 
i-e. 30,963-8940 





















75 STATISTICAL ANALYSIS IN BIOLOGY 

But this includes the variety and place main effects as well as 
their interaction. Deducting the main effect sums of squares, 
as already found, leaves 30 ,963-8940-5,309-9723-21,?,20-9040, i.e. 

4,433-0177 for the interaction. 

The variety-year interaction is found in the same way from 
Table 14b. The divisor is 6 , as summation has been over 
6 places. The total sum of squares is found to be 

^{615-42+517-22+612-22+ . . . -f572-02)-A(6,065-42) 

from which deduction of the two main effect items, for varieties 
and years, leaves 291-8124 for the interaction. 

The last interaction of this order, that between places and 
years, is found from Table 14c, in which each entry is the sum ot 
5 items. The divisor is 5 and the sum of squares 31 , 913-3180 
Subtraction of the place and year sums of squares gives 6,893-9014 

for the interaction. 

The full table of 60 items has 59 degrees of freedom corre¬ 
sponding to a sum of squares of 

(81-02+80-72+ . , . +67-72+105-42+ . . , +94-12)-^(6,065-4)2, 
i.e. 44,732-3540 

The main effects and first-order interactions between them have 

been shown to take up 5,309-9723+21,220-9040+3,798-5126 

+4 433-0177+291-8124+6,893-9014, or 41,948-1204 of this total, 
leaving the difference of 2,784-2336 for the 20 degrees of freedom 
of the second-order variety-place-time interaction. 

The various sums of squares found in this way are entered 
opposite their degrees of freedom in Table 15 and the division 
of each sum of squares by the number of degrees of freedom ^ves 
the mean square or variance appropriate to each of the sets oi 

comparisons. j 

This method of analysis, of both the degrees of freedom ana 

the sums of squares, can obviously be extended to more comp ex 
cases. The main effects are found first by means of the canons 
one way tables. The first-order interactions are obtained irorn 
the two way tables by subtraction of the main effects, -ine 
three way tables give second-order interactions after the corre¬ 
sponding main and first order deductions have been made. J?our 
way tables contain main effects and first- and seconder er 
interactions together with a third-order type which will be 
when the other items found from the lower order tables nav 
been subtracted. Five way and higher tables are analysed oy 

extension of this process. . r 

The question of the significance of the various item 

Table 15 remains. The second-order interaction is the m 
appropriate measure of sampling error in this case, it being usu 
to take the highest order interaction for this purpose, unless so 



TABLE 15 


INTERACTIONS BETWEEN MAIN EFFECTS 


77 



o o o 
o o o 

40 

o 

o 

o o o 

o 

• 

6 

e c c 

o 

1 

o 

C4 

# 

1 

o 0 S 

^ JZ^ 

■*5 ^ 45 

i cd 

•*» 

S S 2 

o 

S 

S c 3 


f2 


z* 

r- 

o 

C4 


a o 

CO CO 

00 

o 

, o 

•S- 

40 

C^l 

40 

o 

C d 

6> o 

CO 

I- 

« 

fH 

* 9 




CO <0 CO 

CO O o CO o ^ 

00 4.0 4.0 00 r; 

FH 40 o d> C;! 

ob ^ M ob o 

^ ^ 

CO 0^ 04 CO 

» 

^ ^ CO ^ 




^ 40 ^ O 40 O 

OI ^ 


CO O 
04 ^ 
r* O 
o o> 

6> o 
O 04 
CO Cl 


<0 

04 


C^ 


^ ^ 2 

^22 
■-I< o 

o p cp c» «;< 

ob ec « SI 

0> « 05 05 CO 

^ M CO 


lO CO <0 ^ 

04 


? 

o 


?> 
o 

QJ ^ 


0 } 

o 




u 

CO 


e 


<n 

o 

o 

C0 


8^h g 


^ CO /A 'M 9) 

o o 2 o i> 

••• ^ ••• •-« ^ 

«u 


u 


(u> 


-2 

I 

O 

.s 


s 


I, .2 

S t 

°l 

a 


3 .e-s 




(. d 

45 o 

o 2 

|l 

i-s 

CO 



d> 

40 


o 

40 

CO 

CO 

* 








6 








78 


STATISTICAL ANALYSIS IN BIOI.OGY 


other is specially indicated by the results. This has a mean square 
of 139-2117 based on 20 degrees of freedom. Only the main 
effect of years may be tested by the calculation of a t, as this 
is the sole item based on 1 degree of freedom. This component 
has a mean square of 3,798-5126, giving 


t 





,798-5126 

139-2117 


-V 27-286=5-234 


with a probability of less than 0-001. Such a departure cannot 
reasonably be attributed to sampling error, and it must be 
supposed that the barley gave different yields in the two years. 

The other items are tested by means of z or the variance 
ratio. The varieties comparisons give a variance ratio of 
9-536 for A'^i“4 and ^ 2 = 20 . (iV^i=degrees of freedom of larger 
variance, iSTj of smaller variance, in this case the error mean 
square.) This has a probability of less than 0-001, so showing 
that there are real differences in yield between the varieties. 
The other variance ratios and their probabilities are given in the 
table of analysis of variance. The 6 different places at which 
tests were conducted are clearly shown to have different fertilities 
and they also differ in their reactions to the varying climatic 
conditions of the two years, as the significant place-year inter¬ 
action shows. The interactions of varieties with places and years 
may reasonably be attributed to sampling error; indeed, the 
latter has a mean square smaller than that which is being used 
as the estimate of sampling error. No variance ratio is given in 
this case. It cannot be calculated in a way parallel to that of 
the other items because in determining a 2 or variance ratio the 
smaller mean square must always be used as the denominator. 
(See Section 16.) 

As these two interactions, variety-place and variety-year, are 
not significant, they could be combined with the second-order 
interaction to give a joint estimate of error, which would, of 
course, be known with greater accuracy in that it would be based 
on 44 degrees of freedom. Division gives this new error mean 
square as 170-6605. In the present case there is little to be 
gained by this procedure, as the increase in precision is small and 
is offset by an increase in the mean square. Where, however, 
a smaller number of degrees of freedom is concerned, this practice 
may be of great value. 

Finally we may consider the question of the standard error 
of the various 3 aelds. The sampling error variance per plot is 
139-2117, so that the standard error of a single plot yield is 

V139-2117^ i.e. 11-7992 bushels per acre. The variance of the 
total yield of a variety based on 12 plots is 139-2117x12, i.e. 
1,670-5404, and the standard error is 40-8718. The total yield 



INCOMPLETE ANALYSIS 


79 


of the variety Manchuria is, for example, l,132-7±40-8718. 
The mean yield of each variety will have a variance of 

12 

or 11*6010, and a standard error of 3*4060. The mean yield of 
Manchuria is thus 94*3917±3*4060 bushels per acre. Lastly, the 
variance of the differences between two means is 11-6010+11-6010 
or 23*2020, the standard error being 4*8169. So Velvet and Trebi 
show a difference of 19-0167±4*8169—a difference which clearly 
cannot be attributed to sampling error and must be judged real. 

Standard errors can be found in the same way for the place 
and year means. The mean yield of the 1931 crop is, for example, 

based on 30 plots. So its variance will be or 4*6404, 

and the standard error 2*1542. The variance of the difference 
between the means of the 1931 and 1932 crops would then be 
4*6404x2, i.e. 9*2808, and the standard error 3*0464. 


26. INCOxMPLETE ANALYSIS 

In the two examples of analysis of variance so far considered, 
classification of the data was complete. The tomatoes, for 
example, could be classified for variety no matter what position 
they were grown in, and the position was classifiable no matter 
^at variety was placed in it. In the same way every barley 
observation could be assigned to its proper variety, place, and 
year. Such complete classification is, however, not always 

possible. 

Suppose, for example, that twenty plants, ten of each variety, 
A and B, were grown in no special order in a greenhouse. There 
would be 20 jdelds recorded, giving 19 degrees of freedom, of 
which one would be assignable to the varietal difference ; but the 
remaining 18 would not be further subdivisible, because there 
would be no correspondence in position between the plants of 
the different varieties. There would be no way of separating 

main effect of position from the variety-position interaction. 
I he analysis of variance would be incomplete owing to the short¬ 
comings of the design and classification. 

This particular example of incomplete analysis is perhaps 
artificial, but somewhat similar situations are often met with in 
pactice. In such cases it is necessary, though not always easy, 
w ascertain the extent to which the analysis is incomplete, by 
determining which classes, theoretically distinguishable, must be 
grouped as a result of the limitations of classification. An 
example of this is provided by Mather and Dobzhansky’s data on 
fhe number of teeth in the sex-combs of the male Drosophila 

pstiulo-obscura. 



80 


STATISTICAL ANALYSIS IN BIOLOGY 

Example 8. This species of Drosophila comprises two races 
which give nearly sterile hybrids on inter-crossing. A number 
of strains of each race were examined in order to determine 
whether they showed any characteristic morphological differences. 
In particular, counts were made of the number of teeth m the 
proximal sex comb of the male. The experiment 
sideration included 4 strains of each race, A and B, and aU the 
strains were raised at two different temperati^es, viz. 17-5 o. 
and 24*5° C. Twenty-five males of each strain raised at each 
temperature were counted, the results being set out in Table 16. 
There were 25x2x4x2 or 400 counts in all, giving a total ol 
399 degrees of freedom. How far can this be analysed ? 

Let us first consider what the complete classification would 
have given (see Table 17). There are 2 races, and hence 1 de^ee 
of freedom for racial differences. The 4 strains contnbute 
3 degrees of freedom for strain differences, and as 4 strains ol 
each race were raised, there will be 3 degrees of freedom or 
race-strain interaction. Two temperatures were used, so gi^ng 
1 degree of freedom for temperature effects, 1 degree of freedom 
for race-temperature interaction, 3 degrees of freedom for strain- 
temperature interaction, and 3 degrees of freedom for the second- 
order interaction between races, strains and temperatur^. 
Lastly, there are 25 individuals with 24 degrees of freedom tor 
differences between individuals, 24 for race-individual interaction, 
72 for strain-individual interaction, 72 for race-strain-indmduai 
interaction, 24 for temperature-individual interaction, 72 lor 
strain-temperature-individual interaction, 24 for race-temperature- 
individual interaction, and finally, 72 for the third order race- 
strain-temperature-individual interaction. It is, however, cl 
that many of these categories cannot be distinguished. 

There are four strains of each race, but there is no 
linking up the strains of opposite races into wh^ f+he 
termed homologous pairs. Each strain is unique. Hence o 
7 degrees of freedom from the 8 strains, 1 will be assigna e 
races, but the remaining 6 must be grouped together as s ra 
differences. The analysis cannot separate the strain 
from the race-strain interaction. Turning to temperature 
ences, the main effect can be picked out, as can the i^terac^ 
of temperature with race ; but there wiU be a group of 6 d 
of freedom for strain-temperature interaction which in tn 
plete analysis would faU into two parts, viz. sprain-tempera^ 
first-order interaction and race-strain-temperature second- 

interaction, each with 3 degrees of freedom. 

Lastly, there are obviously no relations between the g 

individuals of the different samples, in the way that ® P ^ . 
variety A in a given position can be related to a plant ot j 



Number of Teeth xn the ProxiitKil Sex-combs of DrosophUa pseudo- obscura Males {Mather and Dobzharisky) 

Temperature j 


I O 
O \A 

.1. ^ 


^ O 

1^ 


<0 Od OD O 


5f 

o ^ 


CO 00 

CO 1-^ CO 
CO CO CO CO 


CO CO 
Cd 00 o ^ 


^ CO 


to Ci 10 
CO ^ CS| 


CO O C4 o 

M CO ^ 


^ CO CO 
CO CO 


00 00 o 

^ CO 


o ^ 

aO 

^ © 


s ^ 

o 

«s 

^ 2 


*5 ja 


00 00 ^ o 
CO ^ o o 


CO ^ 
ri* n -rH 0^ 


11 

11 

CO 

CO 

CO 

r- 

0 

•k 


00 01 o c^l 

10 CO 

^ ^ ^ lO 


CO r* 


€0 ^ ^ 


«5 

^1 


Od lo ^ r^ 
CO CO CO c-« 


0^0 01^ 
M* ^ IC CO 


O CO to 


CO ^ ^ 


CO 

0 - 

_ 1 


to 

r 


0 




0 ^ 


^ ^ -H o> 


^ 'O' CO 


£ 0) :a <9 
5 ® ® ♦- 

►^*“ £ S 

JS A< H CO 


o 

r CO 

Tif 5 i 

■5 P-^ 

0*1 § ^ 
o ^ 3 o 
CO K O O 


CO 

r* 


0 1 

1 

»o 

CO 

1 - 

0 

<N 





'S c 

o 

p 

So CO 

■B'S 
^ © 


« 2 
© © 
*S JS 

^ & 
s.- 

.5 X 

to © 

.3 5 

S3 „ 

42 © 

W P 
© 

C3 


40 © 
© ^ 

• 5 ^ 
e ^ 

S c 

cr S 
^1 
2 

5 ^ 
© ■‘^ 

2 g 

cj -a 
•*-* ^ 
4) & 

-c S 

^k-a W 


M to 
© c 

•c-M 

c © 
© 

O 

h 2 


81 




82 STATISTICAL ANALYSIS IN BIOLOGY 

B in the same, rather than in any other, position. It is quite 
clear that the only comparison for which any individual can be 
used is that made with the mean of the sample in which this 
individual is contained. Hence the 25 individuals of each sample 
will contribute 24 degrees of freedom to a general pool. This 
pool must include the degrees of freedom for the main comparison 
of individuals and all the interactions of individuals with the 
other main effects as classified in the complete analysis. Thus 
the estimate of variance due to sampling error in this analysis 
will be of a highly composite nature. The only possible sub¬ 
division of this group would be into 16 sub-groups each from 
a different sample ; but unless there is reason to suspect an effect 
of race, strain or temperature on individual variability, such a 
subdivision would carry no advantage. 

Turning next to the calculation of the sums of squares it 
should be noted that, as before, the method of partitioning the 
degrees of freedom points the way to the analysis of the sum of 
squares. First of all there is the race comparison. Race A has 
in all 1,335 teeth and race B 1,119 (see Table 16). Each total 
is the sum of 200 items, the grand total of 2,454 teeth being 
based on 400 flies. Hence the sum of squares for race is 

^( 1 , 3352 + 1 ,1192)-:j^(2,4542)=15,171*93-15,055-29=llG*64 

The main effect of temperature is similarly 

2 ^( 1 , 2762 + 1 , 1782)-:iio(2,4542)=24-01 

Next, summation over strains and individuals gives a 2x2 table 
of temperature by race from which the total sum of squares for 
3 degrees of freedom is 

ii^(6952+6402+5812+5382)-:f^(2,4542)=.141-01 

Subtraction of the main temperature and race effects, as already 
found, then leaves 0-36 for the race-temperature interaction. 

The eight strain totals, summed over individuals and tempera¬ 
tures, are each the sum of 50 observations. Hence the 7 degrees 
of freedom between them will have a sum of squares of 

^( 3372 + 3132 + . . , + 3372 + 2922 + . . . +2482)-:iio(2,454)2 

or 158-39. But this will include the main item for race, deduction 
of which leaves 41*75 for the 6 degrees of freedom depending on 
strain differences. We have now accounted for 9 of the 15 de^ees 
of freedom between the 16 sample totals. The remaining 6 form 
the strain-temperature interaction item. The sum of squares 
can be foimd in either of two ways. It may be obtained directly 
by the calculation of the sum of squares for the 15 comparisons 
between sample totals and the subsequent deduction of the four 


INCOMPLETE ANALYSIS g3 

items already calculated. The sample totals give a sum of 
squares of © * 


Vr(1692+1652+ , . . +1492+ , . . +1342+1682+ 

-189-59 


2 4542 

. +1142)-:^ ^ 

400 


from which the race, strain, temperature, and race-temperature 

24-01, and 0-36 are deducted, to leave 
0-83 for the strain-temperature interaction. 

The second way of finding the interaction in question is based 
on the use of the eight differences in teeth number of the eight 
strains grown at two temperatures. Thusat 17-5° C., Wawona— 6 
h^ 169 teeth, while at 24-5° C. it has 168 teeth, so giving a 
difference of 1 tooth. The other similar items are sliown in the 
l^t column of Table 16. Each figure is based on 50 flies, being 
the difference between two samples of 25, and so the total sum 
of squares for the eight comparisons is 

1^(12+17*+ . . . + 172 + 62 + . . . +202)=31-2 

No correction term is involved when such differences are squared 
(see Section 24). These 8 comparisons include the main tempera¬ 
ture effect and all the temperature-race and temperature-strain 
interactions. The first of the items may be found by squaring 
the total difference, i.e. 98 teeth, and dividing the square by 400, 
to give 24-01 as already obtained in a different way. Wo next 
use the racial totals of the differences, viz. 55 and 43, each based 
on 200 flies, to give 7 ^( 552 + 432 ) or 24-37 as the sum of the 
temperature and race-temperature effects. Deduction of the 
temperature effect leaves 0-36, as previously found, for the race- 
temperature interaction. Deduction of the temperature and 
race-temperature items from the total of 31-2 leaves 6-83 for 
strain-temperature interaction. 

The last sum of squares to be found is that corresponding to 
the individual or error item of 384 degrees of freedom. It is 
most easily found as a difference. The sum of squares of all 
400 single counts is 15,388-7^5(2,454)2, i.e. 283-71. Of the 
399 degrees of freedom on which this total is based, 15 are 
accounted for by differences between sample totals. The remain¬ 
ing 384 form the item under consideration. These 15 degrees 
of freedom have already been shown to take a sum of squares 
of 189-69, so leaving 283-71-189-59, i.e. 94-12 for the error item. 

The analysis of variance is set out in Table 17, the mean 
squares having been found as usual by dividing the degrees of 
freedom into the suras of squares. In calculating the variance 
ratios given in the column next to the mean squares, the error 
or individual mean square was used as the denominator. A 
t for 384 degrees of freedom may be used for the race, temperature 



TABLE 17 

Analyaia of Variance of Teeth in Drosophila Sex-comba 


84 


STATISTICAL ANALYSIS IN BIOLOGY 





o ^ 
o I 

2 I 

•g.2 

a 

>05 


« w S 

lO lO « 
00 C5 
• • * 

^ ^ 


e £ o c 

CS « '«1* •> 
® 3 « G 

'^cg’25 


00 


00 


n 


ob 





CO 

00 

o 

CO 

CO 

CO 

lO 

• 

4 

CO 


o 



Variance Ratio —1 U obtained when the error terra is used. 

Variance Ratio —2 is obtained by oompariBon with some appropriate term other than error* 
* A Variance Ratio for 1 degree of freedom • <*. 



INCOMPLETE ANALYSIS 85 

and race-temperature items and such a t is for all practical 
purposes a normal deviate (c). The interaction of both race and 
temperature may safely be attributed to sampling error, though 
both main effects are clearly significant on this test. 

An examination of the strain and strain-temperature inter¬ 
action puts a somewhat different light on these results, however. 
These two items are tested by the use of z or variance ratio 
tables. It is found that both are highly significant, having 
a probability of less than 0 001. (It may be noted that in using 
the variance ratio table 384 degrees can be considered to be oo.) 
Now if there is a large strain effect, the race difference could be 
sigmficant when tested against the error variance and yet still 
be imaginary in the sense that it was simply a reflexion of the 
large strain differences. So the individual mean square must be 
rejected as an estimate of the error to which race totals are 
subject and replaced by the strain mean square. When this is 
done a < of 4-095 for 6 degrees of freedom is obtained and found 
to have a probability of O-Ol-O OOl. So even this more stringent 
test shows that the racial difference is real and not merely an 
outcome of variation between strains of a race. 

Similarly, the race-temperature interaction should be com¬ 
pared with the strain-temperature interaction. The former is 
much lower than the latter and so the question arises as to 
whether it is significantly lower. The t of 0-562 has, however, 

^ probability of less than 0-70 and there is no reason to suspect 
a subnormal value. 

So the main conclusions derived from the analysis are un¬ 
ambiguous. Strains differ from one another in tooth number 
and also in their reactions to temperature change. Races also 
differ from one another, and to a greater degree than do strains ; 
but they do not show any interaction with temperature. The 
choice of the proper error variance is essential to the rigorous 
testing of these various items. Failure to make use of the proper 
variance could result in seriously faulty conclusions, 

REFERENCES 

fisheb, k. a. 1937. The Design of Experiments. Oliver and Boyd. 
Edinburgh. 2nd ed. 

■ • and YATES, F. 1943. Statistical Tables for Biological, Agricultural 

and Medical Research. Oliver and Boyd. Edinburgh. 2nd ed. 
immbr, f. r., HAYES, H. K., and POWERS, L. B. 1934. Statistical deter¬ 
mination of barley varietal adaptation. J. Amer. Soc. Agron., 
26, 403-19. 

Mather, k„ and dob^ansey, th. 1939. Morphological differences 
between the ‘ races * of Drosophila pseudo-obscura. Amer. Nat., 
73, 6-26. 



CHAPTER VII 


PLANNING EXPERBIENTS 


27. THE FACTORIAL EXPERIMENT 


THE development of the analysis of variance has profoundly 
affected the planning of experiments, for two reasons. In the 
first place the full advantage of this technique can only be 
obtained if the data fulfil certain requirements ; and secondly, 
more complex and informative experiments are made possible 
by the separation of particular comparisons in the analysis of 
variance. The new technique of experimentation has been 
developed and discussed at length by Fisher in his book The 
Design of Experiments. His main principles may be illustrated 
quite simply. 

Suppose it is desired to test the effect on growth of feeding 
two substances, A and B, to some animal, say the rat. The 
traditional way of doing the experiment would be to divide the 
available rats into three groups, using one, to which neither 
A nor B was fed, as a control, and giving one of the remaining 
two groups the control diet with the addition of A and the other 
the control diet with the addition of B. Since it would be 
expected that the individual response to any treatment would 
vary, large numbers of rats would be used in each group and 
some precaution might be taken with a view to equalizing the 
distribution of animals of the various ages and sizes among the 
groups. A likely variant of this procedure would be to investi¬ 
gate the effects of A and B in two separate experiments. A con¬ 
trol group would be used on each occasion so that twice as many 
rats would receive this treatment as either of the others. 

The logic of this technique is not difficult to follow. Uncon¬ 
trollable variation will be encountered and so everything must 
be done to make the effects of A and B, respectively, as strikmg 
as possible in order that the error variation will not obscure 
them. Thus it would be bad policy to mix up A and B as the 
former treatment would increase the apparent variability of 
the B experiment and make its effect more difficult to estab¬ 
lish, and vice versa. Then again, large numbers of rats would 
be used without pa 3 dng too much regard to their origin, as 
increase in numbers reduces the variability of the mean response. 
Lastly, the animals might be assigned to the treatments in some 
particular way to reduce the error to a minimum. Thus the 
emphasis is laid throughout on the reduction of variability by 
whatever means are available. It would probably also be main- 

86 





THE FACTOEIAL EXPERIMENT 87 

tained that the A and B tests must be separated, since a joint 
test woiild give no means of determining their individual effects. 

Fisher’s approach differs in two fundamental respects. In 
the first place he lays stress on the estimation of the uncon¬ 
trollable, or error, variation rather than on its minimization; 
though any device, and he has developed several, which reduces 
this error without invalidating the measurement of the residuum, 
is to be adopted. Secondly, he emphasizes the benefits of in¬ 
cluding as many as possible of the factors, whose effects are to 
be determined, in a single experiment. There are three such 
advantages, viz, (a) greater precision of comparison is obtained 
for the expenditure of a given amount of labour or space, (b) 
information is obtained on more points than the older tj^^e of 
experiment can possibly give, and (c) there is a wider inductive 
basis for any conclusion which might be reached as a consequence 
of the experiment. 

Let us consider these various points in more detail. First of 
all there is the question of error variation. Once it has been 
shown that any error variation occurs at all, the problem of 
testing whether any apparent effect of a specific treatment is 
real or only illusory, in that it arises from the error variation, 
reduces to that of deciding whether the differences could reason¬ 
ably be attributed to random fluctuation or not. In other words, 
a test of significance, of the kind we have discussed earlier, is 
involved ; and such a test of significance demands the use of an 
unbiased estimate of sampling error variation. No matter how 
large or small the error may be, it can be used if, and only if, 
its magnitude is known. Any device which reduces the error 
variation is clearly of value, but it must not invalidate the 
estimate of residual error. The older idea of stressing the neces¬ 
sity of reducing error variation to a minimum clearly assumes 
the necessity for taking this variance into account when judging 
the effect of a treatment; but it misses the real requirement of 
providing a means of making this judgement, viz. the necessity 
for the provision of an estimate of the error’s magnitude. 

Next we may turn to the method of applying the treatments 
m an experiment. Fisher advocates the use of what he terms 
the factorial experiment, i.e. the division of the material into 
a sufficiently large number of groups for every combination of 
treatments to be applied to some fixed number of these groups. 
Certain restrictions of the combinations may be imposed with¬ 
out destroying the factorial nature of the experiment, as will be 

seen later, but in principle all combinations of treatments are 
used. 

In the rat experiment, detailed above, there are four com¬ 
binations of treatments, A and B used together, A alone, B alone, 



88 


STATISTICAL ANALYSIS IN BIOLOGY 

and O, i.e. no treatment. So at least four groups of experimental 
animals would be required, though, of course, any multiple ot 
four would do equally well. One of the four groups would 
receive each of the four treatment combinations. The effect ot 
A will be found as {AB-B)-\-(A-0)y where AB is the ^ect ot 
the double treatment, A that of A alone. B that of B alone 
and O that of using neither substance. The B comparison would 
similarly be (AB-A)+(B-0). These are clearly comparisons 
which can be used in the analysis of variance, provided that 
the number of animals assigned to each group is equal, i 
might be added that such equality in numbers is presupposed 
in the whole design, as othenvise the comparisons given wo^d 
obviously fail to measure the effect desired. When regarded 
from the standpoint of the analysis of variance it is 
that we can also take out an interaction comparison (AB-A-B+U) 
which is independent of the A and B items and gives information 

as to the additive nature of the two effects. 

Every animal is used in the formulation of each of th^e 
three comparisons, so that the experiment has full efhciency tor 
each of the main effects and also for their interaction, ihis is 

in marked contradistinction to the type of experiment outlmed 

earlier, which has at most only J efficiency for the mam effects 
of A and B and supplies no information on their interaction. 
It might be noted that as the number of treatments increases, 

more and more information is to be extracted from 
experiment, as more and more interactions are calculable, ine 
older experiment, on the other hand, gives less and less mlor- 
mation, as more groups must be sacrificed for new treatments. 
In fact, the case of two treatments shows this t 3 rpe of expenmen 
at its best, and how good that best is when compared wi 
Fisher*s factorial design has already been seen. 

The information about treatment interactions is of imme 
importance in exploratory experiments where no earlier 
mation is avaffable. In later experiments, when interacU^ 
may be known to be of less importance, they can 
for the purpose of still further increasing the precision ot 
main comparisons. It will also be seen that by using a 
combinations of treatments the effects of A and B are tes 
a wider variety of circumstances and so the experiment provi 
a broader inductive basis for any conclusions which , 

drawn. Any statements about the effect of, say, A wi 
into account its interaction with B and with C, B, ^ ^ 
if more treatments are used. . , .. 

The factorial experiment is intimately bound up 
analysis of variance ; indeed, the whole succeM of the tac 
design depends on the separation of comparisons m tn y 



AN EXPERIMENT WITH THREE FACTORS 89 

made possible by the analysis of variance. This potentiality 
may also be used for the partial control of error. Suppose, for 
example, that, in the rat experiment, animals were available 
from a number of litters. There is likely to be more variation 
in response to treatment between animals from different litters 
than between animals of the same litter. Now if, say, four 
animals are used from each litter, one receiving each of the four 
treatment combinations, a series of comparisons may be isolated 
in the analysis of variance for the differences between litter 
responses. These will probably have larger mean squares than 
the remaining comparisons within litters and the error variation 
will be correspondingly reduced. 

Consideration of a number of actual experiments canied out 
along these lines will show how the principles of the factorial 
design are applied in practice. 

18. AN EXPERIMENT WITH THREE FACTORS 

Example 9, The following results are taken from a larger 
experiment on the effect of environment on callus formation in 
apple cuttings, conducted by Shippy. The whole experiment 
was not in fact completely factorial though the portion used here 
is of this kind. 

The principal object of the experiment was to discover whether 
temperature had any effect on the rate of callus formation, but 
two other variables could conveniently be introduced, viz. (a) the 
difference between scion cuttings of the variety Yellow Trans¬ 
parent and seedlings of a certain winter hardy northern strain 
and (b) the time factor in callus formation. Both kinds of 
material were cut back and their callus allowed to develop at 
20° C. and at 32° C., the growth being determined at intervals 
of 6 and 7 days after cutting. The amount of growth was 
measured as the diameter, in units of one-quarter millimeter, of 
the callus roll on lip, base, and sides of the cut surface. These 
measurements require a considerable time to make, and so it 
was impossible to use more than one cutting of each kind at 
each temperature. To increase the number of observations, two 
series were started, the second a few days after the first. 

Eight cuttings were used in all, four in each of the two series. 
Two in each series were Yellow Transparent scions, while the 
other two were hardy stock cuttings, and these were kept one 
at each temperature. Each cutting had its callus measured 
twice, once after 6 days and once after 7 days, so there were 
16 observations in all, as set out in Table 18. 

There are 16 degrees of freedom in the whole analysis, and 
these can be subdivided into components testing various effects. 

If we sum like observations of the two series there will result 



90 


STATISTICAL ANALYSIS IN BIOLOGY 


eight totals separable in the basis of (a) temperature, (b) variety, 
and (c) time of observation. Each variety is observed at each 
temperature at each time. So there will be four totals (i.e. eight 
observations) at each temperature and the temperature totals 
will be independent of both variety and time (and also, of course, 
of series), since all combinations of variety and time are included 
equally, viz. twice, in each temperature. Similarly four totals 
are recorded for each variety, and the varietal effect is indepen¬ 
dent of temperature and time. Finally a time comparison can 
be made independently of variety and temperature in the same 
way. 


TABLE 18 


The Growth of Callus by Apple Cuttings {Shippy) 


— Temperature 

and variety 

Time of 
observ’etion 
and series 

20 '* 

C. 

32® 

C. 

Total 

Scion 

Stock 

Scion 

Stock 

Series 1 6 days 

3 

3 

9 

7 

22 

7 days 

9 

9 

19 

7 

44 

Series 2 6 days 

3 

o 

7 

3 

15 

7 days 

8 

5 

5 

3 

21 

Total .... 

23 

19 

40 

20 

1 

102 


Scion = Yellow Transparent. Stock = Hardy northern strain. 


There are two varieties, two temperatures, and two times, so 
that these three main effects will each take 1 degree of freedom. 
The remaining 4, of the 7 degrees of freedom between the eight 
treatment totals, are ascribable to interactions, one of the first 
order between temperature and varieties, one of the first order 
between temperatures and times, one of the first order between 
varieties and times and one of the second order between all 
three treatments. 

So 7 of the 15 degrees of freedom are accounted for by treat¬ 
ment comparisons of various kinds. The remaining 8 depend 
on differences between the duplicate observations of eight treat¬ 
ment combinations in the two series. These can be profitably 
subdivided still further. One is assignable to the total differ¬ 
ence between series 1 and series 2, being independent of treat¬ 
ments since these were equally represented in the two 
The remaining 7 are interactions of the treatment effects witn 
the series comparison. They need not be isolated, but can be 
jointly used to give an estimate of error variation. 









AN EXPERIMENT WITH THREE FACTORS 91 

It should be noticed that this analysis has depended through¬ 
out on the factorial nature of the experiment. The effect of 
temperature, for example, can be separated from those of variety 
and time only because each temperature includes exactly equal 
numbers of observations on each variety at each time. Por 
every observation at 20'' C. there is one of an exactly similar 
nature at 32° C. If one or more observations were omitted, or 
replaced by others of a different status with respect to variety 
or time or both, this balance would be upset and the comparisons 
would no longer be independent. Thus the experiment must be 
carried out with scrupulous care or its whole value will be for¬ 
feited. If the various treatment comparisons are tested by the 
method of Section 23 it will be seen that they answer to all the 
requirements of orthogonality if, but only if, the factorial design 
is adopted. 


TABLE 19 


Analysis of Variance of Apple Callus Growth 


Item 

Sum of 
Squan^a 

N 

Mean 

Square 

t 

Prob¬ 

ability 

Tomporatures (T) . 

20-25 

1 

20-25 

1-444 

0 -2-0-1 

Variety (V) 

3()00 

1 

30-00 

1-914 

0-1-0-05 

Time (D) . . . 

49-00 

1 

49-00 

2-234 

0-1-0-05 

TV interaction 

ICOO 

1 

10-00 , 

1-276 

0-3-0-2 

TO 

9-00 

1 

9-00 



VD 

()-25 

1 

0-25 



TVD „ 

2-25 

1 

2-25 



Series . . • . 

50-25 

1 

50 25 



^rror .... 

08-75 

7 

9 8214 

1 



Total . . . 

263-75 

15 





The analysis of variance is given in Table 19. The three 
main effects and the four interactions of temperature, variety 
and time are found by the methods of Chapter VI. The correction 


1002 

term is i.e. 650-25. 


The temperature sum of squares is 


i(422+602)-650-25, that for varieties J(632+392)-650-25, and that 
ibr times |(37*+652)-650'25. In each case the divisor is 8, 
because each number which is squared is the sum of eight single 
observations. These sums of squares could equally have been 
found by direct utilization of the comparison formulae (Section 25 
and Table 11) as 

’^(60-42)2, ^(63-39)2 and ^(65-37)* respectively. 

The divisor is now 16 as this number of observations is involved, 








92 


STATISTICAL ANALYSIS IN BIOLOGY 

once each, in every quantity to be squared, or in other words, 

8{k^)=lQ. . ^ , 

Two way tables give the material for finding the first-order 
interactions. That for temperature and variety contains four 
values 23, 19, 40, and 20, so the total sum of squares for three 
degrees of freedom is i{2dHl9H^0^+20^)-Q50-25, from which are 
deducted the main temperature and variety items to give 16-00 
for the interaction degree of freedom. The second-order inter¬ 
action is found similarly from the three way table, by deduction 
of the three main and three first-order interaction effects. 

There remain 8 degrees of freedom for which sums of squares 
are to be found. The total of the whole 15 can be obtained by 
squaring the 16 single observations and deducting the correction 
term. It is 263-75. The 7 degrees of freedom already accounted 
for take a sum of squares of 138-75 between them, leaving 125-00 
for the 8 degrees of freedom under discussion. The totals of the 
two series are 66 and 36 respectively, so that an item corre¬ 
sponding to 1 degree of freedom can be found and isolated from 
the total of 8 degrees of freedom just calculated. This sum of 
squares is tt{66-36)2, i,e. 56-25, leaving 125-0-56-25 or 68-76 for 
the 7 error degrees of freedom. The analysis is now complete 

(Table 19). i t 

After the mean squares have been found a t can be calcuiawa 

to test the significance of any of the 8 individual items 
have been separated. All the three main effects have rather 
large mean squares, though not one of them is sigmficant at e 
5 % level. With the exception of one, the interactions are su - 
normal in size. The experiment is thus inconclusive, but 
suggests that either a larger experiment, or one in which the 
error variation was more carefully controlled by the remova o 
further factors, in the way the series difference was ehminateo, 
would give significant results for the three main factors, t oug 

probably not for their interactions. . 

Three main points stand out clearly from the analysi^ 
the first place the elimination of the series difference contribu^a 
materially to increasing the precision of the experiment, 
elimination is made possible by careful design. A faulty es g 
would not permit the isolation of the main series difference, wh 
would then be included in the error variance, with the r^m 
that the latter would be almost twice the value ^ ' 

In that case the experiment would be of considerably 

^^^Secondly, though temperature and time 
observed and trend in the direction which wo^d be 
o priori, they are not significant when compared . , 

variance. The use of a proper experimental techniq 


THE CONTROL OF ERROR 93 

allowed of their being adequately tested and has introduced the 
proper caution into the interpretation of the results. 

Lastly, the results show quite clearly that other unisolated 
^or factors are having a marked influence on callus formation. 
Hence it may be concluded that a full analysis of this process 
wiU require a considerable amount of further exploratory research, 
Ihe two series (Offered so profoundly that a study of the condi¬ 
tions which varied between them might lead to some enlighten¬ 
ment on this point. 

These implications are only realizable because of the adoption 
of the factorial design, and of the care taken in providing a valid 
estimate of error variation. This question of the control and 
estimation of error must be given a more detailed consideration. 


29. THE CONTROL OF ERROR 

^ When considering the handling of undesirable variation it is 
inevitable that the discussion will centre on agricultural field 
^als, since modern experimental technique was initiated and 
hto reached its greatest elaboration in this realm. It is to field 
tnal designs that we must now turn to see how error can be 

controlled. 

It has been shown that a special item of uncontrollable 
^J^ation can be estimated and removed from the error term, 
i-wo questions next arise : (a) Can more than one item be so 
^moved ? and (b) What precautions must be taken to prevent 
tms process from invalidating the estimate of error ? Clearly 
the first question turns largely on the second one. 

In order to reach an answer to this second question it is 
nectosary to recall the principles of a test of significance. The 
delation of the observation to be tested is compared with the 
estimated variance, or with the standard deviation, of a fre¬ 
quency distribution representing the effects of the variables 
which, though present, are not being controlled in the experi¬ 
ment. In order to estimate this error variance a number of 
observations are made of individuals from the distribution, and 
these individuals must be taken at random from the population 
m which they form part. If this is not done the estimate of 
Variance will be biased in one direction or another and so will 
lead to spuriously large or small probabilities when used in the 
of significance, for the whole analysis is, as we have seen, 
based on the assumption that sampling is a random process. 

Now suppose that one of the several variables determining 
the magnitude of the individuals in the population is recogniz¬ 
able. The population can then be divided into two parts con- 
ming respectively the individuals showing the action of one 
phase of this variable and those showing the action of the other 
7 



94 STATISTICAL ANALYSIS IN BIOLOGY 

phase. Provided that these two sub-populations are then 
sampled at random the variance of each will be estimated with¬ 
out bias, or, as is more commonly done, the sum of squares of 
deviations from the means of both sub-populations will, on 
pooling, give an unbiased estimate of the variance of the two 
distributions taken together. So the prerequisite for the suc¬ 
cessful isolation of one cause of variation, while retaining the 
possibility of estimating the residual error, is that the various 
observations shall be made at random within the two groups 
separated by the variable to be isolated. It is then clearly pos¬ 
sible to go further and say that any number of error variables 
may be isolated without affecting the validity of the estimate 
of error, provided always that the observations are made at 
random within the sub-populations distinguished by these means. 
This will also be true if the means of separating the sub-popu¬ 
lations is a controlled treatment rather than an observable, but 
uncontrollable, variable. That which holds for one type o 

separation also holds for the other. 

The necessary and sufficient condition for the correct estima¬ 
tion of error is the random sampling within such groups as may 
be distinguished in the experiment. Now the way that the con¬ 
trolled treatments are applied to the material of the experiment 
may have a considerable effect on this sampling. Suppose that, 
in a replicated experiment, we are applying two alternative 
treatments A and O, having a difference in effect of a, ^ 
different individuals who differ to the extent a: as a result oi 
uncontrolled variation. If A is always applied to the 
which is larger as a result of the effect x the observable di er 
ence between the two individuals will be a+x in all replications. 
If A is applied to the lesser individual the difference will always 
be a-x. If A and O are assigned to the individuals at random 
the estimate of a will be unaffected by x and the estimate o 
X be unaffected by a. Furthermore, no systematic allocation 
of A or O can be used, because although the two individuals 
are here formally distinguished into greater and lesser, in prac 
tice it is impossible to predict which of the two will be the 
and which the smaller. Furthermore, they will not always o 
by the exact amount x in all replications. Random 
of the treatments is the only way of overcoming these difficu i^- 
A little reflection wiU show that the absolute magni^des oi 
the individuals in each replication need not be fixed. The w 
quantities a and x are estimated from differences betw^n 
individuals of the same replication. This method of estima m 
is always successful provided that the treatments are 
equally in the replications, i.e. provided that the design 
torial, and that the treatments are allocated at random wi 


THE CONTROL OF ERROR 95 

the replication. This mle provides the key to the successful 
d^ign and conduct of all experiments. Furthermore, no matter 
what restraints are applied in arranging the replications, a valid 
estimate of error is always obtainable if the different restraints 
affect both one another and the treatments equally, i.e. if they 
are orthogonal both to one another and to the treatments. 

The application of this principle is well shown by the two 
simplest designs used in agricultural trials. These are known 
respectively as the Randomized Block and Latin Square arrange¬ 
ments. In a simple randomized block trial each block is a full 
replication containing one plot of each treatment or treatment 
combination, these being assigned at random to the various plots 
in the block. The apple callus experiment of Example 9 is an 
experiment of this kind. Each of the two series was a ‘ block * 
containing all the treatment combinations. If the experiment 
had consisted of two series each containing single individuals of 
eight varieties, as in a variety trial, it would still have been 
a randomized block arrangement, but the degrees of freedom 
for treatment comparisons would have had different meanings. 
In other words, the operation of the restraint introduced by 
arrangement into two equal series is unaffected by the inter¬ 
relations of the treatments themselves. The estimate of error 
is the same in all cases, being dependent on the variation of the 
different treatment comparisons from block to block. 

The Latin square is more instructive to consider in detail. 
It contains a double restraint in layout, as compared with the 
single restraint of the randomized block. The latter merely 
demands that a given treatment combination shall occur once 
in each block and, as a result, one item for block differences is 
removed from the error in the analysis of variance. The Latin 
square has plots arranged in n rows each of n plots. A square 
of 9 plots would be arranged as 3 rows of 3 plots thus; 

□ □ □ 

□ □ □ 

□ n □ 

It will be seen that this arrangement can also be regarded equally 
well as 3 columns each of 3 plots. If a row is considered as 
a block the whole experiment can be designed as 3 blocks each 
with 3 treatment combinations, but exactly the same result can 
be reached if the columns are considered as blocks. So if the 
three treatments are made to occur once in each row and once 
in each column, a double restraint is brought into operation 
and two items can be isolated from the error component in the 
analysis of variance, one for row differences and the other for 
column differences. 



96 


STATISTICAL ANALYSIS IN BIOLOGY 


A sample 3x3 square is 

ABC 
CAB 
B C A 

where A, B and C represent the three treatment comparisons. 
This design conforms to the necessary conditions since groups 
are separable on the basis of either columns or rows, the remain¬ 
ing restraint and also the treatments, indicated by letters, being 
equally represented in all of them. Rows, columns and treat¬ 
ments are orthogonal to one another. If the layout were 

A B C A 
CAB 
B C 

the restraints and the treatments would not be orthogonal and 
the design would be imsound. 

One further point must be taken into account in laying out 
a Latin square. Within the limits of the restraints the assign¬ 
ments of treatments to plots must be at random. There are 
12 possible ways of laying out a 3x3 square, as shown in 
Table 20, and the random allocation of treatments to plots 
means the random selection of one of these twelve designs. In 
this case 12 squares are all obtainable from one another by 
rearrangement of the order of rows and columns; but this 
cannot be done with all sizes of square. A 4x4 square may be 
of any of 4 standard types, which cannot be converted into one 
another by reshuffling rows and columns. With higher squares 
still more standard types are obtained. 


TABLE 20 


The Twelve 3x3 


A 

B 

C 

A 

C 

B 

B 

C 

A 

B A 

C 

C 

A 

B 

C 

B 

A 

C 

B A 

C 

A 

B 

A 

C 

B 

A 

B 

C 

B 

A 

C 

B 

C 

A 

B 

C 

A 

B 

A 

C 

A 

B 

C 

A 

C 

B 

C 

A 

B 

C 

B A 


Latin Squares 

B C A BAG 

CAB C B A 

ABC A C B 

ABC A C B 

CAB C B A 

B C A BAG 

C B A CAB 

BAG B C A 

A C B ABC 


In planning experiments where the Latin square design is 
used it may be very tedious making up a square for the purpose. 
Certainly the process is rendered much more rapid if the various 
types of square of a given size are catalogued. This cataloguing 


THE CONTROL OF ERROR 97 

has now been done up to and including the 7x7 square, of which 
there are 61,428,210,278,400 types. Whether any higher ones 
will be catalogued is, however, doubtful, as the labour required 
will be very great even for the 8x8 size. Many sample squares 
of high order are given in Fisher and Yates’s tables. 

The analysis of the results obtained from a Latin square 
experiment presents no great difficulty, as the following example 
taken from the Rothamsted Report of 1931 will show. 

Example 10. It was desired to determine whether sulphate 
of potash and superphosphate would affect the yield of ‘ Arran 
Chief’ potatoes grown at a place in the Eastern Counties. 
Furthermore, the phosphatic manure was to be tested at three 
levels in order to obtain some idea of the most economic dressing 
to use. 

Superphosphate was applied at the levels of 0 cwt. (Po), 
6 cwts. (Pj) and 10 cwts. (Pa) per acre, and the sulphate of 
potash at the two levels of 0 cwt. (Ko) and 2 cwts. (KJ per acre. 
There are six possible combinations of these treatments, and so 
a 6x6 Latin square was used. In addition, certain plots received 
a nitrogenous dressing, but this will be disregarded as it appar¬ 
ently had no effect and complicates both design and analysis. 
Each of the plots was ^ acre in size. The detailed arrangement 
was as shown in Table 21, which also gives the yield, in lbs., 
of the various plots. The different treatment combinations 
were, of course, assigned at random witlxin the restraints of the 
experiment. The margins show the summed yields of rows and 
columns, while the summed yields of the different treatment 
combinations are given in the subsidiary table at the bottom. 

The analysis of variance is given in Table 22. The correc¬ 
tion term is i.e. 2,116,540-0278, as the total yield of 

36 f 

36 plots is 8,729 lbs. The total sum of squares for all 36 plots 
is found to be 2,184,663-2,116,540-0278, i.e. 68,122-9722. 

The six row totals are based on 6 plots and so the sura of 
squares for row differences becomes 

i(L430*+l,4U*+l,4862+l,3562+l,558*+l,488*)-2,116,540-0278 

-4,170-1389 

Similarly the column sum of squares is 40,481-1389. Lastly, the 
sum of squares for treatment differences is calculated from the 
six entries in the subsidiary table. Each of these is the sum of 
6 observations made on plots receiving treatments as indicated 
by the headings in the left margin and top row. The calcula¬ 
tion is 

J(L2862+l,317*+l,419*+l,640*+l,611*+l,666*)-2,116,640-0278 

-16,610-4722 



98 


STATISTICAL ANALYSIS IN BIOLOGY 


TABLE 21 

The Yield of ‘ Arran Chief ’ Potatoes in a Fertilizer Trial 

{Rothamsted Report) 

The fertilizer treatment is given above and the yield in lbs. below in each 

plot of the 6x6 Latin Square 


K^Po 

186 

KftP 2 

187 

KoPo 

208 

KoPi 

222 

KxPx 

296 

KxPs 

331 

Row totals 
1,430 

213 

KqPo 

134 

KiPz 

296 

K0P2 

265 

KoPi 

250 

KxPo 

253 

1,411 

KoPi 

198 

KxPo 

155 

K0P2 

272 

K,P, 

290 

KoPo 

261 

, KxPx 
310 

1 1.486 

KiPa 

233 

K,Pi 

184 

KoPt 

218 

1 

KxPo 

234 

1 

K0P2 

248 

K„Po 

239 

1.356 

KflP a 

245 

K1P2 

233 

K,Pi 

282 

KoPo 

248 

KxPo 

247 

KoPi 

303 

1,558 

KoPo 

196 

KoPi 

228 

KiPo 

242 

KxPi 

255 

KxP, 

273 

KpP, 

294 

1.488 

Column t 
1,271 

.otals 

1 1.121 

1,518 

1,514 

1.575 

1,730 

Grand total 
8,729 


Treatment Totals 



Po 

Pi 

P 2 

P 0 +P 1 +P 2 

P,-Po 

P,-2Px+Po 

Kn 

1,286 

1,419 

1,511 

4,216 

225 

-41 

Kx 

1,317 

1,540 

1,656 

4,513 

339 

-107 

Ki+Ko 

2,603 

2,959 

3,167 

8,729 

564 

-148 

Kx-Ko 

31 

121 

145 

297 

114 

-66 


TABLE 22 


Analysis of Variance of the Potato Yields 


Item I 

Sum of Squares 

N 

Mean Square 

Variance 

Ratio 

Rows 

4,170-1389 

6 

834-0278 

2-431 

Columns 

40,481-1389 

6 

8,096-2278 

23-600 

K 

2,450-2500 

1 

2,450-2500 


Pi 

13,254-0000 

1 

13,254-0000 


P, 

304-2222 

1 

304-2222 


KP, 

641-5000 

1 

641-5000 


KP, 

60-5000 

1 

60-6000 


[Treatments 

(total) 

16,610-4722 

5 

3,322-0944 

9-684 

Error 

6,861-2222 

20 

343-0611 


Total 

68,122-9722 

36 




I 


2-673 

6-215 

0-942 

1-267 

0-420 


Probability 
0-20-006 
very small 
002 - 0-01 
very small 
0-40-0-30 
0-30-0-20 
0-70-0-60 

very small] 



THE CONTROL OF ERROR 09 

The three items, rows, columns and treatments, account for 
6 degrees of freedom each, leaving, out of the total of 35, 
20 degrees of freedom for the estimation of error. The error 
sum of squares is found by subtracting the row, column and 
treatment sums of squares from the total, to give 

68,122-9722-(4,170*1389+40,48M389+16,610-4722)=6,861-2222 

This completes the first analysis and it will be seen that after 
finding the mean squares and, from them, the variance ratios, 
the column and treatment items are significant. The row item 
IS just not significant at the 5% level. In spite of this, how¬ 
ever, the double restraint was most probably justified as the 
row mean square is fairly large and, in any case, a randomized 
block experiment, unless very fortunately arranged, would not 
have removed the full soil effect shown by the columns, with 
a consequent loss in efficiency. 

If the six treatment combinations had been unrelated, or 
dependent on varietal differences, no further analysis would be 
possible; but in the present case a more searching inquiry into 
the effects of the different fertilizers can be made. Full analysis 
18 possible as the design incorporates all possible combinations 
of Kq, Ki and Pq, Pi and P* equally. 

There are 5 degrees of freedom for differences between the 
SIX combinations. One of these is ascribable to the main effect 
of sulphate of potash and will be calculated from the (K^-Ko) 
differences. Three such differences exist, distinguished by the 
presence of Pg, Pj or Po in the plots from which they are found. 
The whole comparison may be represented as : 

{KyP fy—KfiP i—KJP i+KiP x—KqP g) 

The six k coefficients are all 1 in this expression, but as each item 
IS the sum of six observations the divisor must be 6(1+1+1+1+1+1) 
if the sum of squares is to be placed on the basis of single plots. 
This may be expressed somewhat differently by saying that the 
divisor is 36 as the number to be squared is compounded of 
36 observations used once each. Arithmetically this sum of 
squares is found as 

lsVl,317-l,286+1,540-1,419+1,656-1,511)2, { q 2,450-2500 
This could, of course, equally have been found as 

S 790* 

tJ^(4,2162+4,5132)-^ 

There are three phosphate treatments and so there will be 
2 degrees of freedom for differences between them. The total 
sum of squares corresponding to these two degrees of freedom 
can be obtained from the P,, Pj and Po totals summed over 



jQQ STATISTICAL ANALYSIS IN BIOLOGY 

K, and Ko as shown in the tWrd Une of the subsidiary Table 21. 
This gives 

Q 70Q2 

tL( 2 . 6032 + 2 , 9592 + 3 ,1672)-^-^ or 13,558-2222 

There still remain 2 degrees of freedom for the interaction of 
K and P. Subtraction of the main K and P items from the total 
sum of squares for treatments leaves 602-0000 for this inter¬ 
action sum of squares. It could also have been found from the 
(^Ki-Ko) values in the bottom line of the subsidiary table, by 

taking , 

.^(312+1212+1452)--^ 

This calculation is strictly analogous to that of tlm main P sum 
of squares from the (if.+if.) sums of the line above. 

The analytical possibilities do not stop here. The 2 de^ee 
of freedom for the main effect of P and the 2 degrees of freedom 
for the KP interaction may be separated into their components. 
We might expect that if P has any effect at aU the treatment 
P will have twice the effect of the treatment P,. This suggests 
the subdivision of the P effect into single items based on the 
comparisons (P,-P„) and (P.-2P.+P.) respectively. The tot 
one tests whether phosphates are effective manures while tne 
second tests the linearity of the response, if any. The sums 01 
squares are found from the formulae . 

-^(KiPt+KoPi-KiPTT T) 

^(K,P^+K^^-2K,P,-2KJ^,+K,Po+KoPoV 

These are easily shown to be independent comparisons. I^eed, 
the only point about them requiring comment is the detem 
tion of the divisors. The tot one presente little difficulty, m 
the number to be squared is based on 24 observations each 
once. The second is a little more troublesome because 12 plots 
are used twice each in addition to the 24 used once. Section 
shows that the divisor must be S(k‘)- In this case to 
24 entries and fc -2 for the remaining twelve, and so ) 

(24xl)+(12x4)=72 

The two sums of squares are then 

:^(1,656+1,511-1,317-1.286)2=1^,254-0000 

and V^(l,656+1,511-2x1,540-2x1,419+1,317+1,286)*-304-22 

which together equal the joint sum of squares found 

The ^alysis of the KP interaction is conducted s^m ^ 
except that I is based on theJK.-K ) diffe^n^ i^^^f 
the (.Ki+Ko) sums used in the calculation of the m 


1 



THE CONTROL OF ERROR 


101 

of P. The divisors are just as before, so that the two sums of 
squares are : 

^(145-31)2=541-5000 

and ^(146-2x121+31)2=60-50000 

The analysis of variance has now been taken to its limit. 
The significance of the five separate treatment effects may bo 
determined by the use of Va because each is based on one degree 
of freedom. Each t will have, of coiirse, 20 degrees of freedom 
as this number of comparisons was used in the estimation of 
error. The main effect of potash, marked K in the analysis, 
and also the item testing the effect of phosphate, marked Pi, 
are highly significant. The potatoes respond to both manures. 
The item P* testing for departure from a linear response to 
phosphate is not significant. So it would appear that a double 
dose of phosphate has twice the effect of a single dose. Neither 
of the potash-phosphate interactions is significant and so there 
is no evidence that the response to one manure is affected by the 
presence or absence of the other. 

The main points to be observed about the design of this 
experiment are (a) the retention of the factorial design, which 
allows of fine analysis of the treatment effects, while a double 
restraint is applied to reduce the error variance, and (b) the 
method of obtaining a valid estimate of error when a doubly 
restrained design is used. 

Squares afford the possibility of applying more than two 
restraints to the design. In a 3x3 square, for example, we can 
^sign three Greek letters in such a way that each occurs once 
in every row, once in every column and once with every Latin 
letter, thus; 

Ay Ca 
Off Aa By 
Ba Cy A, 

Not all Latin squares are capable of being turned into Graeco- 
Latin squares of this type. Of the four standard types of 
4x4 squares only one gives a Graeco-Latin design, viz. 

Aa B^ Cy D, Aa B^ Cy O, 

By A, Da C, B, Ay D, Ca 

C, D, B. D, A, B, 

C. B, D, C, B, A^ 

These two Graeco-Latin squares cannot be mutually transformed 
by reshuffling rows and columns or even letters, and so they are 
distinct standard types. A single type of Latin square may give 
several types of Graeco-Latin square in this way, and the num¬ 
ber of possibilities with higher-order squares becomes very large. 



102 


STATISTICAL ANALYSIS IN BIOLOGY 

In practice, however, examples where three or more restraints 
have been used are seldom encountered. Perhaps this is because 
such designs have not been used widely outside agricultural work, 
where a third restraint would do little to help the reduction of 
soil error variation. In other types of work it is likely that 
multiple restraints would be advantageous, though they should 
obviously be used with care sufficient to obviate any loss of 

precision by the immoderate reduction of the number of error 
degrees of freedom. 

30. CONFOUNDINO 

Whatever restraints are apphed with a view to reducing error 
variation, the factorial design of the treatment combinations 
must be retained if the experiment is to be efficient. In all the 
cases considered so far this has been achieved by the equal use 
of all possible treatment combinations in every replication. 
Fisher has shown, however, that this is not necessary for main¬ 
taining the factorial nature of the design. Certain limitations 
of the treatment combinations, which allow the use of more 
stringent restraints to reduce error variation without affecting 
the calculation of the treatment effects, are possible. 

In order to see how this can happen, let us consider the case 
of two treatments, A and B, each used at two levels, Aj and A®, 
and Bi and B®. A simple randomized block experiment would 
have four plots to the block, one taking each of the four treat¬ 
ment combinations. There would then be 3 degrees of freedom 
for differences between treatments and single comparisons could 
be isolated for the main effects of A and B and for their first- 
order interaction. With n such blocks the analysis of variance 
would be: 

TABLE 23 


Item 

N 

Blocks 

n-1 

A main effect 

1 

B main effect 

1 

AB interaction 

1 

Error 

3n-3 

Total 

4n-l 


The sums of squares for the three treatment items would be 
found from the formulae 

4n 

4n 



CONFOTODING 


103 


The independence of the main effect of A and block differ¬ 
ences will be ensured by placing AjBi in the same block as 
AqBi, and A,Bo in the same block as AoBo, though these two 
pairs, on the other hand, need not be in the same block. For 
suppose that AjBj and AqBi are in a block with fertility x, while 
A,Bo and AoB© are in a second block with fertility //, the sum 
of squares then becomes 

J[(AiB,+x)+(AiBo+y)-(^o5i+a^)-(^oSo+2/)]^ 

This reduces to the original form because the x and y items 
cancel out. The same result could be achieved by having AjBj and 
AoBo in block X and AiBo and AqBi in block Y. This is perhaps 
less obviously true than is the case of the first arrangement, but 
a little reflection will show that block differences cancel out as 
before. The rule is apparently that the comparisons can be 
made free of block differences provided that the two plots of a 
block have opposite signs in the formula from which the sum of 
squares is calculated. 

The B comparison may also be found independently of block 
differences if the treatments are paired in either of the two ways 

Block X Block Y 

(a) AiBi and AjBo AqBi and AoBo 

(b) AiBi and AoBo AoB^ and A^Bo 

Similarly, pairs which will give an unbiased estimate of the 
interaction item are : 

Block X Block Y 

(a) AiB, and AiBo AoBj and AgB® 

(b) AiBi and AoBj A,Bo and AoB® 

These various arrangements are collected together in Table 24, 
together with the comparisons from which the three sums of 
squares are calculated. There are three arrangements of pairs in 

TABLE 24 

Confounding Arrangements 


Comparisons Main effect A 

Main effect B AJi^^AJiQ-^AfJi^-AQBa 
Interaction AB A^B^—A^Bq—AqB^^-AqB^ 


Paira of plots in block 

Coinpurisozis 

X 

Y 

Rccovorablo 

ConfouDded 

A|6| A|Bo 

AqBj ApBp 

B and AB 

A 

A|15| AqUx 

A|30 ApBp 

A and AB 

B 

A^B^ A^Bp 

1 

AjBp ApBj 

A and B 

AB 







104 


STATISTICAL ANALYSIS IN BIOLOGY 


two blocks and each arrangement permits the isolation of two 
of the three treatment comparisons independently of block 
differences. The third comparison is inextricably mixed up with 
the differences in fertility between the two blocks and cannot be 
recovered. It is said to be ‘ confounded ’ with block differences. 

So by sacrificing one of the three treatment comparisons the 
number of plots per block is halved and the number of blocks 
can be doubled. This may be of great value where either the 
number of ‘ plots * per ‘ block * is limited, as in twin research or 
feeding experiments with animals, or where it is necessary to 
keep the block size small in order to reap the full benefit which 
local control gives in the reduction of error variation. 

Confounding would probably not often be practised in such 
a simple case as that outlined, but the same principles apply to 
more complex experiments such as the one on asparagus described 
by Wishart. 

Example 11. Table 25 gives the yield in lbs. per plot of 
asparagus which had been subjected to all combinations of the 
three fertilizers, nitrogen (N), phosphate (P) and potash (K) each 
at two levels. Eight blocks each of four plots were used. Four 
of the blocks had one plot each with dressings of nitrogen 
alone (N), phosphate alone (P), potash alone (K) and all three 
together (NPK). The other four blocks each had one plot with 
no fertilizer (O), with nitrogen and phosphate together (NP), 
with nitrogen and potash together (NK), and with phosphate 
and potash together (PK). 

The design involves confounding, as each block carries but 
four of the eight possible treatment combinations, though all 
eight have been used equally when the whole experiment is taken 
into account. Our first task, then, is to determine which com¬ 
parison or comparisons have been sacrificed to increase the local 
control of error variation. 

There are seven treatment comparisons whose sums of squares 
arc calculated from the formulae 


Main fp 
effectsl 


First-order 

interactions 


TSTP 

NK 

PK 


^\(NPK-PK+NP-P-^NK-K+N-0) * 
^^{NPK+PK+NP+P-NK-K-N-Oy 
^{NPK+PK~NP~P+NK+K-N~0) = 
^\{NPK-PK+NP~P-NK+K-N+0 )» 
^ (NPK-PK-iVP+P+NK-K-N+O) » 
^{NPK^^PK-NP-P-NK-K+N-\rO) * 


Second-order 

interaction NPK ^{NPK-PK-NP^P-NK+K+N-Oy 

The divisor is 32 as all comparisons involve, once each, the sums 
of the four plots, receiving each of the various treatments, there 
being eight treatments in all. 



CONTOTJNDINa 


105 


TABLE 25 

Yields of Asparagus in a Fertilizer Trial (If'w/jar/) 


The fertilizers are shown above and the yields in lbs. below 


Block 

Tj^pe of block 

! Plots 

! Block total 

I ' 

X 

NPK 

K 

P 

N 




120 

16-2 

14-6 

12-7 

55-5 

2 

Y 

NK 

NP 

PK 

0 




10-2 

12-8 

13-8 

13-9 

50-7 

3 

Y 

0 

NP 

NK 

PK 




14-7 

9-3 

8-8 

9-0 

41-8 

4 

X 

N 

P 

K 

NPK 




10-8 

8-9 

8-3 

10-3 

38-3 

5 


PK 

0 

NK 

NP 




130 

12-7 

11-3 

10-3 

47-3 

6 

X 

K 

P 

N 

NPK 




13-3 

15-2 

121 

11-5 

52-1 

7 

X 

NPK 

K 

P 

N 




11-4 

10-4 

11-7 

9-3 

42-8 

8 

Y 

NP 

0 

NK 

PK 




91 

10-5 

8-2 

13-5 

41-3 


Treatment Totals 


Block X 


Block Y 


Grand 

NPK NPK 

Group 

Total 

NP NK PK 0 

HI 

Total 

45'2 44-9 60-4 48 2 

188-7 

41-5 38-5 49-3 51-8 

nn 

369-8 


A comparison of these formulae with the layout adopted 
shows that in the first six formulae half the plots with a plus 
sign occur in each of the two types of block, as do those with 
aminussign. Thus in the first one, the main N comparison, one 
type of block contains NPK, N, P and K which have the signs 
+. +, - respectively, while the other type of block contains 

NP, NK, PK and O with signs +, +, So the first six com¬ 

parisons can be recovered free of block differences. The seventh 
comparison is the one which is confounded. The types of plot 
with a + sign all occur in one kind of block and those taking 
a - sign in the comparison are found together in the other kind 
of block. This comparison cannot be separated from the differ¬ 
ence between the two kinds of block. 

Having decided which comparison is confounded, the analysis 
of variance is easy. There are 31 degrees of freedom in all, of 
which 7 are taken up by block differences, and 6 by unconfounded 






106 


STATISTICAL ANALYSIS IN BIOLOGY 


treatment comparisons. This leaves 18 degrees of freedom for 
error. 

QAq.Q 2 

The correction term is -, i.e. 4,273’5013, and the total 

32 

sum of squares for 31 degrees of freedom is found from the 
32 individual observations to be 145*4787. The blocks each 
contain four plots and so, in the calculation of the block sum of 
squares, the divisor is 4. This sum of squares is then found in 
the usual way to be 65-0237. The treatment sum of squares 
cannot, however, be found in quite the normal fashion, because 
the 8 treatments yield only 6 degrees of freedom. 

If the treatment totals are separated into two groups of four, 
according to whether they are found from one t 5 q)e of block or 
the other, there are 3 degrees of freedom within each group. 
The seventh degree of freedom, that between the groups, is the 
one which has been confounded and which has already been 
determined as part of the block difference item. The sum of 
squares corresponding to the 3 degrees of freedom for treatment 
comparisons from the first type of block is found as 

J(44-9«+50-42+48-22+45-22)-1£BiZ! 

16 

the correction term being derived from the total yield of this 
group of results. The other set of treatment is similarly obtained 

107.12 

as 4(51-82+41*52+38-52+49-32)-i^^ giving with the first group 

a joint sum of squares of 34-8638 for treatments. The error 
variance is obtained by subtraction from the total sum of squares. 
The resulting analysis of variance is set out in Table 26, from 
which it will be seen that while the block item is highly significant 
that for treatments could reasonably be ascribed to sampling 
error. 

TABLE 26 

Analy9i8 of Variance of Aaparagus Yields 



Sum of 


Mean 

Variance 



Item 

Squares 

N 

Square 

Ratio 


Probdbility 

Blocks 

65-0237 

7 

9-2891 

3-668 


001 

N 

27-3800 

1 

27-3800 


3-289 

0-01-0-001 

p 

0-2813 

1 

0-2813 


0-333 

0-80-0-70 

K 

1-7112 

1 

1-7112 


0-822 

0-50-0-40 

>rp 

0-4050 

1 

0-4050 


0-400 

0-70-0-60 

NK 

0-1250 

1 

0-1250 


0-222 

0-90-0-80 

PK 

4-9613 

1 

4-9613 


1-400 

0-20-0-10 

[Treatments 

(total) 

34-8638 

6 

6-8106 

2-294 


0-20-0055 

Error 

45-5912 

18 

2-6328 




Total 

146-4787 

31 







CONFOTJNDENO 107 

The next part of the analysis is the decomposition of the 
treatment comparisons, and the formulae used earlier for the 
detection of the confounded component can be used for this 
purpose. The individual items are then found to have the values 
shown in the analysis of variance. Comparison with the error 
item by means of t tests shows that the main effect of nitrogen is 
significant, but that all other items fail to show any real 
departure from expectation. Nitrogen is the only fertilizer 
which affects the yield. It might be noted that its effect on 
yield is curious in that the addition of nitrogen decreases, rather 
than increases, the amount of asparagus harvested. 

Taken together, the six treatment items have a barely signi¬ 
ficant mean square, but when they are decomposed one of them, 
the main effect of nitrogen, is found to be highly significant. 
The nitrogen effect was masked by the other five components, 
but decomposition brings out the relative importance of the 
various items. 

In this example the second-order interaction was completely 
lost, but a slight modification would have allowed of its recovery 
though with reduced precision, and at the expense of decreasing 
the precision of other comparisons. Suppose that the eight 
blocks are grouped in pairs, each block consisting of an X block 
and a Y block. Then consider what would have been recoverable 
if the various X and Y blocks were not alike but had had the 
following contents : 

Block X Block Y 

1. NPK,N,P,K NP, NK, PK, 0 

2. NPK, PK, N, O NK, NP. K, P 

3. NPK.NK, P,0 PK,NP,K.N 

4. NPK, NP, K, 0 PK, NK. P, N 

The first ^oup is of the type already examined and has the 
second-order interaction NPK confounded. If group 2, however, 
is analysed by means of the comparison formulae, it is found to 
have the first-order interaction of P and K confounded. Similarly, 
the first-order interactions NK and NP are confounded in groups 
3 and 4 respectively. So, such an experiment would give 
estimates of the three main effects, N, P and K, from all four 
groups and estimates of each of the interactions from three of 
the four groups. All treatments are recoverable, but the four 
interactions have only | the precision of the main effects. 

^ Where soil heterogeneity is very marked the use of such 
a ‘ partially confounded ' design may, by doubling the number of 
blocks, so reduce the error variance that the | precision of the 
partially confounded comparisons may even be greater than the 
full precision obtained with half as many blocks each including 



108 


STATISTICAL ANALYSIS IN BIOLOGY 


the full range of treatment combinations. Partial confounding 
is a very valuable device in the application of complex series of 
treatments to limited experimental material and has been used 
with great success by Yates in designing experiments for a large 
number of different purposes. 

There are, however, limitations to the use of confounding. 
In some cases the confounding of one comparison may auto¬ 
matically mean the confounding of another, and potentially more 
interesting, item. Reference should be made to accounts given 
by Fisher and Yates for a more detailed discussion of the scope 
and limitations of confounding designs. These designs are worth 
attention as they are likely to be of great value in many kinds 
of biological experimentation. 

REFERENCES 

FISHER, R. A. 1937. The Design of Experiments. Oliver and Boyd. 
Etlinburgh. 2nd ed. 

_ and YATES, F. 1943. Statistical Tables for Biological Agricultural 

and Medical Research. Oliver and Boyd. Edinburgh. 2nd ed. 
Report of the Rothamsted Experhnental Station. 1931. 

SHIPPY, w. B. 1930. Influence of environment on callusing of apple 
cuttings and grafts. Contr. Boyce Thompson Inst.^ 2, 351—88. 
wiSHART, J. 1940. Field Trials: Their Lay-out and Statistical Analysis. 

Imperial Bureau of Plant Breeding and Genetics. 

YATES, r. 1937. The Design and Analysis of Factorial Experiments. 
Imperial Bureau of Soil Science. 



CHAPTER VIII 


THE INTERRELATIONS OE TWO VARIABLES 

31. LINEAR REGRESSION 

IT often happens that in an experiment or series of observations 
interest centres on the relations holding between two or more 
measurements made on the same individual, family or occasion. 
The determination of growth rate, for example, depends on 
™ding the relation between measurements of time and size ; 
the study of heredity demands observations on at least two 
members of each family, and so on. The statistical reduction 
of such data is achieved by the use of regression and correlation 
techmques, of which the former class is much the more widely 
applicable and so will be considered first. 

The law relating the values of two variables to each other 
may be written algebraically in the form y=f{x), where f{x) 
represents a function of x of any degree of complexity. It often 
happens, however, that a simple manipulation or transformation 
of the data may be used to reduce a complex f(x) to a relatively 
simple form, so leading to a considerable reduction in the labour 
of calculation. Thus the compound interest law of growth is 
often written in the form 

y-yoc** 

where x is the time at which the size y is attained, y^ a constant 
depending on the size at time 0, and k a constant which is called 
the efficiency index. Such a formula would be troublesome to 
handle statistically. But if y is transformed into its logarithm 
the operations are much simpler, as 

log y-log yo+kx 

This is an expression of the general form y^a+bx, which is the 

equation of a straight line. When faced with the analyses of 

two variables likely to show a complex relation a search should 

always be made for a transformation of this kind with a view to 
reducing the work involved. 

The simplest relation between x and y from the statistical 
point of view is that of the straight line y=o+6a:, and in con¬ 
sequence this is the most widely used in statistical analysis. 
The statistical problem which presents itself is that of estimating 
the constants a and b from data provided by a series of double 
observations {x^y^), &c. In the case of the growth formula 

given above, the data would be the size y, at time size y, at 
time a:,, and so on. We could then by logarithmic transformation 

8 109 



110 


STATISTICAL ANALYSIS IN BIOLOGY 


recast the data into a new series of double observations (log Xi), 
(log i/z, 0 ^ 2 ) ... . and so have the material for the estimation of 

the two constants, log ?/o and k. 

If neither x nor y was subject to any uncontrolled, or error, 
variation the problem would be easy of solution ; for it would 
only be necessary to have two double observations to give the 
values of the constants. 

Xj-Xj \ xj-Xg / 

Both X and y may, however, be subject to error variation of 
various kinds and degrees. The estimation of the two constants 
is then less obvious, and the actual method used will be dependent 
on the types of error to which each variable is subject. When it 
is desired to determine a growth law, for example, the size of 
the individual, y, will be subject to marked variation as a result 
of genetic, dietetic, climatic or other uncontrolled factors ; but 
the time, x, at which the size observation is made will be known 
with reasonable exactitude. Now the size, y, of a number of 
individuals at time x may be considered to be normaUy distributed 
about the mean y. So long as an unbiased estimate of y is 
obtainable, as it wiU be if the population of individuals is sampled 
at random in making an observation, the relation of y to x can 
be fairly determined. Thus error variation in y does not 
invalidate the calculation provided that no selection is exercised. 

The situation with regard to x may be of a very different 
kind A serious error In the determination of the time, x, in a 
growth relation will invaUdate the whole calculation, as x is not 
normaUy distributed about a mean x. It mil not be subject to the 
usual treatment of error variation. But it should be noted that 
any value of x may be chosen for making the corresponding 
assessment of size, y, without spoiling the data, for there is no 
question of taking random samples from a large population o 
times. The time x must be determined accurately, but its 

values may be selected in such a case. 

Where these differences in the type of variation are present, 

X is called the ‘ independent variate * and y the ‘ dependent 
variate The distinction is very important, as the determination 
of the regression of y on x is a valid statistical operation, w e 
the calculation of the regression of x on y may be vitiated by 0 
wrong assumptions made about the nature of the variation 

which the two quantities are subject. • f n 

Sometimes, however, x may be subject to error variation 

like V, in which case selection of the data should not be practisecn 
When X and y are exactly of the same nature, both being su jec 
to error variation, either regression may be calculated wit equ 



LINEAB REGRESSION 111 

validity. Even in this case, however, the two regressions will 
not be the same. The differences between the regression of 
y on X and that of a; on y will be better appreciated after some 
concrete cases have been considered. 

Where the regression oi y on z is linear, i.e. the relation of 
yto X IS representable geometrically as a straight line, the value 
of y corresponding to any value of x can be found, once the 
constants a and h are known in the equation y=^a-vhx* Estimates 
of a and 6 could be obtained in any number of ways from a set 
of double observations, though when the calculated line and the 
observed points were plotted on a graph some would obviously 
not be so good as others. It would not, however, always be clear 
from such inspection which of any two formulae gave the better 
fit. How, then, should the constants be evaluated ? 

To answer this question it is necessary to consider the measure¬ 
ment of discrepancy between observed values of y and those 
expected on the basis of any given formula. In the absence of 
®'ny information about x^ the variability of y, as we have seen 
in an earlier chapter, would be measured by finding the sum of 
squares of deviations of y from its mean y. This would be 
equivalent to finding the sum of squares of deviations of y from 
the line 2 /=y which, as shown in Eig. 5, is parallel to the x axis 
of the graph. The sum of squares of departures from any other 
such line, parallel to the abscissa, would be larger than that found 
when using y as it would include an item representing the square 
of the difference in position between this new line and the line 
2 /“^. To use such a second line would bo equivalent to calculating 
the sum of squares of deviations from a working mean y^. We 
have seen in Section 10 that 


The last term must be positive as it is a perfect square, and so 

^ky-ymY>8{y-W 

Hence, in the absence of information about x, the line best 
representing the observations is found by minimizing the magni¬ 
tude of the sums of squares from the value of y which it represents. 
The best value, y, gives the smallest sum of squares. 

As soon as knowledge of a: is introduced it is possible to 
arrive at expected values of y closer to those observed for the 
straight line need no longer be parallel to the abscissa.’ When 
y alone was considered, the line could vary in position but not 
slope. Now it can slope at any angle depending on the constant 
6, and this slope can bring about a reduction in the sum of 


• The dotonnination of tho rogrossion of a; on t/ would involve tha 
calculation of the constant* a' and b' in the equation z^a'-^h'y. 



112 


STATISTICAL ANALYSIS IN BIOLOGY 


squares of deviations from the expectations derived from the 
line. The sum of squares of y is, in fact, being separated into 
two parts, one representing departures from the regression line 
and the other the difference in slope between the regression line 
and that line parallel to the abscissa through y. Clearly, then, 
the best fitting regression line will be the one which maximizes 
the item attributable to the slope of the line and hence minimizes 



FIG. 6 

The relations of the lines y^ym> to observed points, 

to show how the deviations of the latter are progressively reduced. Deviations 
from are marked by thin uprights and deviations from ys»y+6(«—i) by thick 
uprights 


that attributable to the departure of observation from the linear 
expectation. 

Let a:, for convenience’ sake, be measured from its mean 
and let the expected value of y be denoted as T. Then 

y'=o+6(a;-») 

which gives a deviation of 

(y- T)^y-a-b(x-x) 

when compared with the observed value y. The sum of squares 
of such deviations is 

S{y-Y)^S{y-a-h{x~x)-\^ 

-/S(y»)+-Sf(a*)+S[6(a;-i)]»-2S(ay)-25[y6(a:-i)]+2S[ai(z-a:)] 
which, since a and 6 are constants and S{x-x)=0, may be written as 
S(y_r) 2 _,S(y 2 )+„a»+ 6 »S(a:-x)*- 2 aS(y)- 26 S[y(a:-*)] 
where n is the number of double observations. 



THE SAMPLING ERROB OF REGRESSION CONSTANTS 113 

We require the values of a and b which will make this sum 
of squares a minimum. These may be found by partially 
differentiating with respect to a and 6, equating the expressions 
so obtained to 0 and solving for a and 6. 


From (i) 


—S(y-Y)^^2an-28{y)=Q 
^^S(y- YY=2bS(x-x) »-25[2/(a:-i)]=0 

n ^ 


(ii) 


From (ii) * 

S{x-x)^ 

This method of finding a and b is referred to as the method 
of least squares and is basic to the theory of regression. It is 
a special case of the method of maximum likelihood discussed in 
Chapter XII. 

The formulae for the evaluation of a and b are not surprising. 
The constant a fixes the position of the regression line and 6, 
which is itself called the regression coefficient, fixes its slope. 
Now when a;=f, i.e. when x is out of the picture, y^a^y^ as would 
be expected from the consideration of the case when knowledge 
of X is lacking. The formula for b bears an obvious resemblance 
to that used in finding the standard deviation of a distribution 
(Section 10). The other possibility that b is evaluated as the 
ratio of the sums of deviations of y and x from their means, 
neglecting sign, resembles the calculation of the mean deviation 
rather than the standard deviation of a distribution, and the 
mean deviation is not so informative a measure as the standard 
deviation. 


32. THE SAMPLING ERROR OF REGRESSION CONSTANTS 

Having found a and 6, it is next necessary to determine the 
sampling errors to which they are subject. These are dependent 
on the residual variation of y about the regression line. We 

8\y(x~xS\ 

have found that and 6- In each case the greater 

the error variance of y the greater the possible random fluctuation 
in a and 6. So the first step is to calculate the sampling variance 
of y itself. This is dependent on the sum of squares of deviations 
of y from the regression line and can always be found by squaring 
and summing these deviations. There is, however, an easier 

• 5[i/(a:-i)]=^[(y-y)(a;-x)] since *S[(y-i/)(ar-x)]-5[i/(a:-f)-y(a;-i)] 

•S[y{x-£)\~yS{x^x) and *S(a:-x)-0. 



114 


STATISTICAL ANALYSIS IN BIOLOGY 


calculation which will give the same results. Substituting for 
a and b in the general regression formula we find 


S(y-Y)^ 




=S^-y-(x-x 

S\ y^»P-2yy^(x-x) 2y{x-xf-ip^ 

L S^(x-x)'‘ ' S(x-xy 




‘S[y^)+ny^-2yS(y)-\r 


S(x~x) ^S^ly{x-x)] 


~S{x-xy J 


S^(x-x)^ 


^S^yix-x)] 2yS(x-x)S[y(x-x)'\ 


S(x^x)^ 

which, since S{y)=ny and S(x~x)=0, gives 

S^{y) S\y{x-x)-\ 


S(x-x)^ 


S(y-YY^S(y^)- 


n 


S(x^x)^ 

The various terms of this expression are easily recognized. The 
second is the usual correction for a working mean of 0, i.e. for 
not using a y as the origin in measuring y, and so constitutes 
a deduction depending on the value of a. The third and last term 
is -6AS^[y(a;-i;)] and is the correction for the sum of squares removed 
by the slope of the regression line. It bears the same relation 

S[y(x-x)\ __ ^ 1 ,__ S‘‘(y) 


to 6, i.e. to 


as the correction for the mean, viz. 


S(x-x)^ n 

does to the mean itself. In this way the sum of squares of 
deviations of y from its mean is partitioned into parts attributable 
to regression and residual error. 

As a sum of squares can be attributed to the effect of fitting 
a regression coefficient 6, there must be a corresponding partition 
of the degrees of freedom. Where one value of y is observed 
there will be one comparison with any theoretically fixed mean ; 
but if the mean is to be estimated from the data the value itself 
will provide the best estimate of this parameter. So there will 
be no degree of freedom for the estimation of the variance. 
This is, it will be remembered, the reason for the use of n-1 as 
the number of degrees of freedom in calculating the variance of y. 
Since a=y the calculation of a from the data is accommodated 
by the use of N=n—\. 

When there are two observations, fitting the mean will still 
leave one comparison available for the estimate of the variance. 
But a straight line can be drawn through any two points on 
a graph showing the relation of y to a:. Hence the calculation of 
6 removes the single remaining degree of fireedom. Then clearly 
with n observations, n—2 degrees of freedom remain after a and 



THE SAMPLING ERROR OF REGRESSION CONSTANTS 115 

b have both been fitted. It is another example of the loss of 

one degree of freedom when a parameter has been estimated 
from the data. 

The analysis of variance of y obtained in this way is 

Item Sum of Squares N 

Regression 1 

S{x-x)^ 

Remainder or error n-2 


Total 



n-1 


The test of significance of 6 is now obvious. The two mean 
squares are found and their ratio is a t for n-2 degrees of freedom, 
which will test whether the calculation of b has removed 
a significantly large sum of s(juares, i.e. whether b differs signi¬ 
ficantly from 0. If b does not differ significantly from 0 then, 
so far as the observations go, y is unrelated to x. If b is significant 
it constitutes the best estimate of the change in y for each unit 
change in x. 

The standard errors of a and b are not difficult to find when 
the sampling error of y is known. The variance of y, Vy, is, 
from the analysis given above, 

n-2l ^ n S{x-x)^ 

Now a is the mean of y and so from Section 20 



V =>-V =_ 

" n " n{n 


i-2)[_ n S{x-x)^ 


It was also shown in Section 20 that if y is multii)lied by any 
given quantity, the variance of y is multiplied by the square of 
the quantity. So 


and 




as 


But 


and so 

S^(x-x)^^ 


1 r 


(n‘-2)8(x‘-x)\ 


^\2 


S^x-x)‘ " S{x-xy 
8 ^ s^[y{x~x)] 


S{x-x) 





116 


STATISTICAL ANALYSIS IN BIOLOGY 


lif'xample 12. The application of this analysis to the inter* 
pretation of experimental data may be illustrated by Steward 
and Harrison’s results on the rate of uptake of rubidium ions 
by potato slices. Table 27 gives the number of milligram 
equivalents per 1,000 gms. of water in the potato tissue after 
immersion in a solution of rubidium bromide for various numbers 
of hours. What is the relation of Rb ion content to the length 
of time of immersion ? 


TABLE 27 


The Uptake of Rb and Br ions by Potato Slices (Steward and 

Harrison) 


Time of immersion 
(*) 

mg. equivalents per 1,000 gms. of 
water in the tissue 


Br{y^) 

21-7 

7-2 

0-7 

46 0 

11-4 

6-4 

67-0 

14 2 

9-9 

90-2 

19-1 

12-8 

95-5 

20 0 

16-8 

1 

Total 320-4 

71-9 

45-6 


5(a;)=320-4 5 (j/b)-45-6 

f=6408 wr-14-38 Pb =912 


jS(a;-f)»=3.800-948 S(yR-^R) 

5[i/b(x—*)]=667*508 
6b= 0 172985 
Error sum of 

squares=0-6888 

ry^-0-1963 

F6^-0000051C45 

56^=0 007186 


= 114-328 5(yB-pB)’-137 068 

5[i/b(x-x)]-714-302 
6b=0 187927 
Error s\im of 

squares—2-8311 
Fy^-0-9437 

F6g-0*000248280 

56 -0016767 

O 


d-6B-6B“0014942 

5rf=0017315 


Clearly time must be the independent variate, Xy and Rb con¬ 
tent the dependent variate, y, as the latter is subject to normal 
error variation while the former is not. Furthermore, the times 
have been selected as convenient to the experimenters, while there 
is no reason to suspect any selection in the choice of potato 
for analysis at any given time. So the regression to be determined 

is that of Rb content on time. 







THE SAMPLING ERROB OP REGRESSION CONSTANTS 


117 


From Table 27 n=5. S(x)=320-4, and S(y)=ll-9. 

iS(a;-f)2»5(a;2)-^i-^24,332-180-20,531-232=3,800-948 

n 

,02/,i\ 

148-250-1,033-922=114-328 
n 

The calculation of <S[?/(.T-i)], or the identical *5[(y-y){a;-f)], 
requires a word of explanation. It could, of course, be found 
by multiplying together each y value and the corresponding 
(x-i) and subsequently summing. But it is more conveniently 
found in a way analogous to that used in computing the sum of 
squares of x or y. For 

S[y(x~x)]=S(xy)-S[yx)^8{xy)-xS{y)=~8{xy)-^^^^^ 

n 


390*4x71 ^ 

Then in the present case jS'[y(x-x)]«5,264-860—^-=657-508 


So 


-= 14-38 and 6 = 

'' - S(x-x)^ 


n 

657-508 


0-172985 


3,800-948 

The linear regression of y on x is thus given by the formula 

y=14-38+0-172985(x-x) 

The content of Rb in the potato increases by 0-172985 mg. 
equivalents per 1,000 gms. of water each hour. 

To find the standard deviation of 6 the variance of y must 
first be analysed. The total sum of squares has already been 
found to bo 114-328, for, of course, 4 degrees of freedom. The 
regression of y on x takes from this total a sum of squares of 

- for 1 degree of freedom. iS'[y(x-x)]“657-508 and 

4S(x-x)* 

^(x-xj^-a,800-948, and so the sum of squares for regression is 

A M ^ ^ ^ ^ Ik 


657-508* 

3,800-948 


or 


113-7392. The full analysis becomes 


Item 

Kogression 

Error 


S um of Squares 
113-7392 
0-5888 


N 

1 

3 


Mean Square 

113-7392 

0-1963 


Total 


114-3280 


The significance of the regression coefficient can then be 

/ll3-7392 

calculated by the calculation of hgr J 9.1903 -24 071, which has 
a verj' small probability. The coefficient is very significant. 



118 


STATISTICAL ANALYSIS IN BIOLOGY 

A t test is used as the error variance is estimated on the basis of 
three comparisons from the data. 

The variance of b is given by the formula Vh= -i— F*. and 

0’1963 

so arithmetically is - -0-000051645 and ^6 = a/F6*0-007186. 

o,o00*948 

If the departure of b from 0 is compared with this estimated 

0*172985 

standard deviation, - - -=24*071, as already found by 

the analysis of variance of y. This is not surprising because the 
two tests differ only in the method of arriving at t, just as when 
the difference of two means was tested in Sections 21 and 24. 


The standard deviation of a is found as 




1963 


=0*1981. 


It is sometimes of interest to know whether the regression 
line passes through some point with which it might theoretically 
be expected to agree, within the limits of sampling error. This 
can be tested quite easily. When x~Xi and Fi=a+6(a;i-x), 
Fy,=*Fa+(a:,-x)2F6. The derivation of this formula follows from 
the principles developed in Section 20. The statistics a and 6 are 
orthogonal. Hence the variance of a compound of a and 6 is 
the sum of the variances of the two parts which are dependent 
on a and 6 respectively. In the calculation of Fj, a takes the 
coeflScient 1 and so contributes l^xFg to the variance of Fi, but 
b has a coefficient of {Xj-x) and so F^, is multiplied by (zi~x)^. 
Hence Fy, is the sum of F^ and (Xirx)^Vt,. The deviation of 
Fi, from the value with which it is to be compared, has a 

standard error of VFo+(arj-f)*F6, and its significance may be 
tested by the calculation of L- gi= — ^_ for a number 

of degrees of freedom equal to that for the estimation of Vy. 
This allocation of the degrees of freedom may not appear obvious 
at first sight, but it follows from the fact that 

Fy = vX -+ if-!I -1 

It is now clear that Fy, is derived from Vy by the use of a 
multiplier itself independent of y. Hence Fy, has the same 
number of degrees of freedom as Vy. 

In the case of Steward and Harrison’s data it is of some 
interest to know whether the regression line may be considered 
to pass through the origin, i.e. the point (0.0). Substituting 
0 or Xu 64*08 for x, and 14*38 for a, we find 

Fi-14-38+0 172985(0-64*08) 

= 14*38-11*0849-3*2951 



DIFFERENCE BETWEEN TWO REGRESSION COEFFICIENTS 119 

Since the value expected is 0 the deviation, d, is 3-2951. 

t'd=Fy = F,-t-(-64-08)*Ffc 

-0-03926+(-6408)2xO-000051645=0-251327 
and 5d=VTd=0-50133 

Then fj3j-_=6-573, which has a probability of less than 0 01. 

The line cannot reasonably be supposed to pass through the 
origin. Since, however, at time 0 the Rb content of the potatoes 
was presumably 0 also, this result must mean that the regression 
was not linear for the whole time preceding the first measurement 
at 21*7 hours. The absorption of Rb ions must have been more 
rapid in the early stages than it was later in the experiment. 


33. THE DIFFERENCE BETWEEN TWO REGRESSION COEFFICIENTS 

It is often necessary to test the agreement of two regression 
coefficients, and this may be done in either of two ways, which, 
however, lead ultimately to the same test. In the first place the 
difference of the two coefficients may be compared with its 
standard error by means of a ^ test. The standard error is found, 
as might be expected, by taking the square root of the sum of 
the variances of the two coefficients, provided, of course, that 
they were, as is usually the case, independently determined. 
Thus if 

d=bi-bt 

+ and V F. + F,. 


Then 


^d_ b,-b, 

«d Vv. + v, 


and will have a number of degrees of freedom equal to the sum 
of the numbers available for the estimation of Fj,, and Fj,. 

The second way of testing the agreement of 6i and 6^ is by 
the use of the analysis of variance. Each of the two regression 
coefficients accounts for a characteristic sum of squares in the 
analyses of variance of the separate sets of y data. Now if, 
apart from sampling error, 6, is really equal to we can put 
6 i-6a«6 and calculate the value of b from the pooled sums of 
squares of x and sums of cross-products of x and y as derived 
from the two sets of data. A regression coefficient found in this 
way will account for a characteristic sum of squares in the joint 
analysis of variance. This sum of squares for b must be less 
than the pooled sum of squares for bi and 6|, and will have 
1 degree of freedom, whereas the latter pool will have 2 degrees 
of freedom. The difference between the pooled sum of squares 
for 6, and 6* and the sum of squares for b corresponds to the 



120 


STATISTICAL ANALYSIS IN BIOLOGY 


remaining degree of freedom and may be used as a test of signi¬ 
ficance of 61 - 62 , since it is the sum of squares remaining when 
the best joint regression coefficient has been found. This 
approach can be extended to test the homogeneity of three or 
more regression coefficients and is of wider application than the 
simple use of a i test. 

Example 13. Steward and Harrison immersed their potato 
discs in a solution of rubidium bromide and estimated the Br 
content at the same times as the Rb content. The Br figures 
are given in Table 27 together with the Rb data. Is the rate of 
uptake the same for both kinds of ion ? 

The regression line of the Rb content on time has already 
been found as : 

r„= 14-38+0- 172985(a:-f) 

The regression of Br content on time can be found in exactly 
the same way. The same observation times were used and so 

f=C4 08 and >S(x-x)2«3,800-948 as before. But -=9-12 

5 

and 5(?/B-yB)2=552-940—137-068 

5 


320-4x4'>*fi 

636-350- - - - °714-302 

5 


Then 


714-302 

3,800-948 


0-187927 


This regression accounts for a sum of squares of 


714-302* 

3y^0“948’ 


i.e. 134-2369 out of the total 137-068, leaving 2-8311 as the 
portion corresponding to the three error degrees of freedom. Then 


and 

Thus 


2-8311 


D 


V 


3 

0-9437 


-0-9437 


=0-000248280 


3,800-948 

d=6B-6«=0-187927-0-172985=0-014942 


and VO-000051645+0-000248280-0 017316 

R B 


t 


[6]’ 


0-014942 

o^oItsTs' 


0-863 


This t has (3+3) degrees of freedom, as and were each 

R B 

estimated from 3 comparisons. The difference, d, is less than 
its standard error and hence the two rates of uptake are in good 
agreement. 

To test the significance of 6 ^- 6 ^ by means of the analysis of 
variance it is necessary to calculate a joint regression coefficient. 



DIFFERENCE BETWEEN TWO REGRESSION COEFFICIENTS 121 

This can be done from the material already available. The joint 
sum of cross-products, like the joint sum of squares of x, is equal 
to the sum of the two separate items. So 

jSr[y(a:-f)]=657-508-t-714-302=l,371-810 

and >S(a;-f)2=3,800-948+3,800-948=7,601-896. 

Then ^^^ =0-180456. The sum of squares accounted for 

7,601-896 

K 4-u- • • 4. • • 1,371-8102 

by this joint regression is =247-5518. 

The two separate regressions accounted for 113-7392 and 
134*2369 of the sum of squares of and yg respectively, leaving 
0*5888 and 2-8311 for the two error items. Taking yg and y^ 
together, this means that 247*9761 was taken out by the two 
regressions for 2 degrees of freedom in all, while there remained 
3*4199 corresponding to the 6 error degrees of freedom. Of the 
247-9761 for the two regressions, we have found that 247-5518 
is attributable to the best fitting regression coefficient, leaving 
0-4243 as the sum of squares for the difference between bg and bg. 
These results are set out in the form of an analysis of variance 
in Table 28 which also gives the mean squares. Two fs may be 
calculated, one to test the significance of the joint regression and 
the other to test the difference between bg and bg. The result 
of the former test is not in question as it will obviously be very 
large. The second test, that of the difference between bg and 

gives ^^.i?^-0-863 with a probability, as before, of 


5700 


between 0-5 and 0-4. There is no evidence that the uptakes of 
Rb and Br ions proceed at different rates. 

TABLE 28 

AnalyHa of Variance of Uptake of Rb and Br Iona by Potato Slices 


Item 

Joint regreBsion 
Difference between 
regressions 
Difference between 
means 
Error 


Sum of 
Squares 

247-5618 

N 

1 

Mean 

Square 

247-6618 

1 

Probability 

0-4243 

1 

0-4243 

0-863 

0-6-0-4 

69-1690 

3-4199 

1 

6 

69 1690 
0-6700 

11-014 

very small 


Total 


320-5650 


One further item is included in the analysis of variance in 
Table 28. Those already considered account for 8 degrees of 
freedom but, when the two series are considered together, there 
are ten observations with 9 degrees of freedom between them. 
The ninth degree of freedom is that for differences between the 



122 


STATISTICAL ANALYSIS IN BIOLOGY 


means of and This clearly has been omitted from the 
analysis so far since all the sums of squares have been taken 
about the separate means of and yji in the two series. The 
last item can be found as ■^['S'(y/j)-iiS(2/B)]®» divisor being 
10 because each observation is used once in obtaining the number 
to be squared. From Table 27 the calculation becomes 

■iL[71-9-45-6]2=69-1690. 

As this sum of squares corresponds to one degree of freedom, 
the mean square is also 69*1690 and comparison with error gives 


t 


[ 6 ] 



1690 


5700 


= 11*014. The two means definitely differ. 


The two series of observations, the one on the Rb content 
and the other on the Br content, were made at exactly the same 



FIGt 6 

steward and Harrison’s obser\*ations, on the rate 'of uptake of Rb and 
Br ions by potato slices, in relation to the calculated regression lines. The two 
regression lines, though parallel over the period of obser\'ation, pass on either 
side of the origin when projected, so showing that the rates of uptake must 
have differed prior to the £^t observation 


times and, as 6^=6^, the difference between their means indicates 
that the regression lines differ in position. They are plotted, 
together with the observed points, in Fig. 6, from which it will 
be seen that whereas the rate of uptake of Rb must have been 
higher before 21*7 hours than it was afterwards, the rate of 
uptake of Br must have been lower before this time than it was 
later. The two kinds of ion show the same behaviour after the 
initial observation but differed in the early stages of the expen- 
ment, Rb starting rapidly and Br slowly. 

34. THE USE OF CONCOMITANT OBSERVATIONS 

Before leaving the subject of linear regression for the con¬ 
sideration of more complex situations, one further and very 



THE USE OP CONCOMITANT OBSERVATIONS 123 

important use must be mentioned, viz. the incorporation of 
concomitant measurements into a statistical analysis. The value 
of this technique is most easily shown by consideration of a 
concrete example. 

Example 14. There is reason to suspect that trout fry raised 
in swiftly running streams respire more actively than others whose 
life has been spent in slowly moving water. Washbourn conducted 
an experiment to test this expectation. His results were expressed 
in cubic millimetres of oxygen consumed per gram of fish per 
hour, but for our purpose it is necessary to reconstruct the actual 
amounts of oxygen used per hour by the fish in the various 
experiments. These data are given in Table 29, together with 
the wet weights of the fish themselves. 

TABLE 29 


Oxygen Consumption in Cubic Millimetres per Hour of Trout Fry 

{Washbourn) 


Series 

Wet weight 
of fish (x) 

Total oxygen 
consumption (y) 
(reconstructed) 

Oxygen consumption 
per gin. wet 
weight 


71 

766-8 

108 


70 

854-0 

122 


7-5 

1,080-0 

144 


7-4 

954-6 

129 

Swift 

7-6 

802-5 


water 

7-6 

862-5 

115 


4-4 

501-6 

114 

1 

4-3 

417-1 

97 


6-2 

695-2 

96 


8-2 

1,033-2 

126 

Total 

671 

7,867-5 

1,158 


7-2 

1 6120 

85 


6-2 

942-4 

162 


4-4 

365-2 

83 


4-0 

276-0 

69 

Slow 

7-7 

731-5 

95 

water 

5-6 

487-2 

87 


6-2 

369-2 

71 


6-3 

498-2 

94 


6-6 

464-8 

83 


71 

667-4 

94 

Total 

68-3 

6,413-9 

913 

Grand total 

125-4 

13,281-4 

2,071 






124 STATISTICAL ANALYSIS IN BIOLOGY 

The experiment was quite a simple one. Newly hatched 
trout fry were separated into two groups, one being raised in 
swiftly running and the other in slowly moving water. The 
source of the water and its temperature were controlled so that 
each sample of fish was subjected as nearly as possible to the 
same conditions, apart from the speed of flow. The rate m 
respiration of the fish, when of a convenient size, was measured 
by finding their oxygen consumption. Parallel tests of the two 
groups were conducted, ten assays from each set being made in 
all. The total weight of the ten fish used in each assay was also 
recorded. These weights were somewhat variable, the individuals 
used from the swiftly running water having a higher mean than 
those from the other group. Though interest centres on the 
oxygen consumption, the concomitant measurement of fish 
weight is clearly of importance in interpreting the results, as 
any difierence in respiration rate between fish from two diff^ent 
habitats could perhaps be ascribed to weight discrepancies. How 
should the weights be used in the analysis of respiration rate ? 

Two courses are open. Either some arbitrary correction ot 
the oxygen consumption can be made to allow for variation m 
weight, or the data can be used to supply their own correction. 
If the former method is adopted the correction may be based 
on previous experience or on a priori reasoning. In either case, 
objections may be raised as the experiment under consideration 
may differ from others in which a correction could be calculated, 
and a priori arguments are very dangerous guides in such c^es. 
No exception can, on the other hand, be taken to a correction 
based on the data of the experiment itself and such a correction 

is not difficult to calculate. . 

Ten tests were made on individuals from each habitat, © 
weights of the fish varying from test to test. The ten tests us 
supply all the information necessary for determining the regres 
sion of oxygen consumption on fish weight. The 
testing the effect of water speed then resolves itself into 
whether the tw'o regressions of respiration on weight, one ro 
the slow and the other from the fast water, are in fact di ei© • 
This test would offer no difficulties if treated by the me o 
of the previous section, but the analysis would 
laborious, because, by the very nature of the 
same fish weights do not occur in the two series 
An alternative method, known as the analysis of cov * 

may be adopted if it is desired only to perform 
significance, without calculating the actual regression 

The null hypothesis to be tested is that the rate o 
tion of equaUy heavy fish is the same whether they * 
in slow or fast water. We thus assume that tn k 



THE USE OF CONCOMITANT OBSERVATIONS 


125 


of respiration rate on weight is the same in fish from both 
habitats. When this is not true, the hypothesis will be as much 
disproved as if the corrected means were different. So the 
regression coefficient can be estimated from the two series jointly, 
and the method is exactly the same as that used for the uptake 
of Rb and Br ions in the last example. The sums of squares 
of X and cross-products of x and y are found from each series 
separately, using the series, and not the general, means, and the 
data are pooled before calculating 6. In such a calculation both 
the sum of squares of x and the sum of cross-products of x and 
y are in reality being split up into two parts, the sum within 
series and the sum between series, though this latter does not 
appear explicitly in the analysis. The process may, then, be 
cast in the form of an analysis of variance of x and of the 
covariance of x with y. 

For all of the 20 observations iS(a:)=125-4, where x is the wet 

12 ' 5*42 

weight of the fish, and 5(a:-i)^=819*64—-—"33*382. The two 

^\J 

series totals, 67*1 and 58*3, are each based on 10 observations, 
and so the sum of squares of x between series is 

A(67-l*+58-3*')-i^.3-872 

which on being subtracted from the total sum of squares already 
found leaves 29-510 for the item within series. Of the total 
19 degrees of freedom, 1 is accounted for by the difference between 
series, leaving 18 for differences within series. 

The analysis of the cross-products is made in the same way. 
5(x)-125*4 and ,S(y)"13,281-4. Then 

8\y(x-x)] is 88,185-242 

The cross-product between series is 

281-4 

tV(67*1x7.867*5+58-3x5,413-9)-^^^^-^’— "1.079-584 

leaving 4,105-658 as the sum of cross-products within series. 
The degrees of freedom are partitioned as in the case of the 
analysis of a:. 

Though not wanted immediately, it is convenient to analyse 
the variance of y, the oxygen consumption, at this stage. The 
total sum of squares is 1,101,071-422 and the sum of squares 
between series 301,007*648, leaving 800,063*774 within series. 
The three analyses, of a:*, xy and y*, are set out in Table 30. 

The regression coefficient of respiration on weight would 
normally be found from the line of Table 30 giving the items 
9 



126 


STATISTICAL AiJALYSIS IN BIOLOGY 


TABLE 30 


Analysis of Covariance of Oxygen Consumption of Trout Fry 


Item 

Between series 
Within series 

N ar* 

1 3-872 

18 29-510 

xy 

1,079-584 

4,105-658 

y* 

301,007*648 

800,063-774 

Correction for 
regresdion 
of y on z 

671,210-695 

Total 

19 33-382 

6,185-242 

1,101,071-422 

805,426*116 

Item 

Between series 
Within series 

Analysis of Variance of y after Correction 
Sum of Squares N Mean Square i 

66,792-227 1 66,792-227 2-227 

228,853-079 17 13,461-946 

Probability 

0-05-002 

Total 

295,645-306 

18 




within series. The cross-products item is 4,105‘658 and the sum 

of squares of a:, 29*510. So 6= ^*^^^ f^^ °139*1277. The sum of 

29*510 

r r 1,. I. *t,* • + • 4,105*658* _ 

squares for y for which this regression accounts is 29 ~ 5ld '~ 


571,210*695, leaving 228,853*079 for the 17 error degrees of 
freedom of y. 

The regression coefiScient calculated in this way can be used 
to make a direct correction for weight differences between series. 
When a test of significance is wanted, however, this is not the 
best procedure as it leads to a slightly biased result through 
failing to take account of the effect of sampling error in the 
estimation of b. The exact test is given by a slightly different 
procedure. A correction for regression is calculated from the 
total sums of squares and cross-products, in exactly the same 
way as that already found from the items within series. This 

correction is clearly °805,426'116, which on subtraction 

leaves 295,645*306 as the corrected total sum of squares of y. 

Each of the two regression corrections, obtained from the 
total sums and from the sums within series respectively, accounts 
for 1 degree of freedom, leaving 18 for the corrected ^ total * 
and 17 for the corrected * within series * sums of squares of y. 
If, when due allowance has been made for weight differences, 
there is no difference in the oxygen consumption of fish from 
the two habitats, the two corrections are equivalent within the 
limits of sampling error. Then the subtraction of the corrected 
sum of squares of y within series from the corrected total sum 
of squares of y will leave a corrected sum of squares of y between 
series, so giving an analysis of variance of corrected y valu^. 
This is shown in the lower part of Table 30. The item between 



THE USE OF CONCOMITANT OBSERVATIONS 


127 


series in this corrected analysis will have 1 degree of freedom 
as before, since the corrected total has 18, and the corrected item 
within series has 17, degrees of freedom. The mean squares are 
found by division of the sums of squares by N and a t is used 
to compare the item between series with that within series. 

A ' - — ... 


t 


J 


which has a probability of between 0*05 and 0-02. The diflference 
between the two series is suggestively large. 

Before leaving this example it will be of interest to compare 
the results obtained by the analysis of covariance with those 
which would have been found if no correction was made for the 
effect of weight and also with those which would have followed 
the use of an obvious arbitrary correction. 

The analysis of variance of uncorrected y is to be found in 
Table 30 and this is, of course, the analysis which would be 
used to test the significance of water speed if no allowance were 
to be made for weight difference. The mean square between 
series is 301,008 and that within series 44,448. The test of sig¬ 
nificance is thus ifi8i=. /-rr^r^ 

'V 44,448 


2-602, which has a probability 


of less than 0*02. The significance of the difference shown by 
this analysis is greater than that obtained when corrections are 
made. The neglect of the weight variation has led to a mis¬ 
leading exaggeration of the significance. In other experiments 
-the result could of course go the other way, viz. the significance 
could be underestimated by neglect of the concomitant measure¬ 
ment. In either case the test of significance would be misleading. 

The most obvious arbitrary correction for weight difference 
is that of using the oxygen consumption per unit weight of fish, 
rather than the oxygen consumption itself, as the variate. There 
are some theoretical arguments against the use of such a cor¬ 
rection, but none is so strong as the objection which becomes 
apparent when the result of this analysis is compared with that 
of the analysis of covariance. The oxygen consumption per unit 
weight is given in Table 29 for each of the 20 tests. The analysis 
of variance of these data follows exactly the same lines as that 
for total oxygen consumption. The mean square between series 
is found as 3,001-260, while that within series is 378-761. Then 


/?!??JlM 9»2-816 with a probability of only just over 0-01. 

The arbitrary correction has not, in fact, made a proper allow¬ 
ance for the weight difference. On the contrary, it has magnified 
the already misleadingly high significance obtained when the 
uncorrected total oxygen consumptions are analysed. The dan- 



128 


STATISTICAL ANALYSIS IN BIOLOGY 


gers of arbitrary correction are clearly very large. Where con¬ 
comitant observations are involved the analysis of covariance 
should be used, and the experiment allowed to supply its own 
correction formula. When designing an experiment in which 
concomitant observations will be made, care should be taken 
that the results will be susceptible to analysis by the covariance 
method. 

This example is a very simple one and the use of the correc¬ 
tion did not make a very large difference, though, since the 
probability was in the 0-05-0-02 region, the effect was important. 
In many cases, however, the results are completely obscure until 
an analysis of covariance has been made. The correction need 
not just be that appropriate to linear regression. Polynomial 
and multiple regression corrections can be used, provided that 
sufficient degrees of freedom are available to supply an estimate 
of error after their deduction. 

REFERENCES 

STEWARD, F. c., and HARRISON, J. A. 1939. The absorption and accumu¬ 
lation of salts by living plant cells—IX. Ann. Bot. N.S., 3, 
427-54. 

WASHBOURN, B. 1936. Metabolic rates of trout fry from swift and slow 
r unnin g waters. J. cxp. Biol., 13* 145-7. 



CIIAPTEE IX 


POLYNOMIAL AND MULTIPLE REGRESSIONS 

35. TESTING LINEARITY OF REGRESSIONS 

THE most common type of regression calculated in statistical 
analysis is the straight line having the formula T'=a+6(a:-f). 
Other cases in which the relation of T to a; is better repre¬ 
sented by a line of higher order are, however, not infrequent. 
The equation may be quadratic, y=a-i-6i(a;-i)+6j(a;-x)2, cubic, 
Y=a-\-b^(x-z)-vbi(x-x)--\-b^(x-xYy or indeed of any power of x. 
Fitting such poljmomial regressions clearly involves more com¬ 
putational labour than a straight regression line, and this 
naturally would not be undertaken unless necessary. So it is 
desirable to have a method of testing the adequacy of straight 
regression lines for the description of the relations existing 
between x and y. A rigorous test of this point is not always 
possible, but where replicated experiments, or their equivalents, 
are used as the source of data the fit given by a straight-line 
regression can be tested by a simple adaptation of the analysis 
of variance. 

Example 15. Table 31 gives the results of an experiment 
conducted by Ashby to find the growth rate of a certain variety 
of tomato. The plants were grown, together with certain other 
varieties, in four replications or blocks. Two samples were taken 
from each block at the ages of 10 days, 17 days, 24 days, 31 days, 
38 days and 45 days. The dry weights of the samples were 
determined. Since the growth curve was expected to be of the 
form where y is the dry weight and x the time, the 

logarithm of the dry weight was used as the variate in order 
that a straight-line relationship between x and y should be 
calculated (see Section 31). Do the results justify the expecta¬ 
tion that the regression of the logarithm of dry weight on time 
is linear ? 

There are 48 observations in all, so the analysis of variance 
of y, the logarithm of the dry weight, will include 47 degrees of 
freedom. Of these, 3 are dependent on block differences, 5 on 
differences between the samples taken at different times, and 
16 on the interaction of blocks and times. The remaining 24 
are not further separable as they are dependent on differences 
between the two samples taken each time from each block. The 
analysis is, in fact, incomplete (Section 26). 

The sum of all the observations, jS(y), is 112*898 and so 

129 



TABLE 31 

The Growth of Tomato Plants in Units of the Logarithm of Dry Weight {Ashby) 


130 


STATISTICAL AJfALYSIS IN BIOLOGY 


1^1 

o c 5 

H S 


t*- ® 00 

« ^ CO 0^ lo o 
Oi r* O ^ lO 

lb cb o 

^ ^ Cl 04 


cc 00 Cl ^ 

O Oi CO cs ^ cc 

CO ^ C4 ^ 

O A Cl <N cb cb ^ 


O O CC O CO Cl 
00 X ^ cc CO CO 
O lO Cl ca Cl CO 

O -N Cl cb cb 


lO O CO CO X Cl 

X O X X lo -N 

I- Cl X X lo 

O M cb cb cb 


>> ^ 

Cm 

O 0) 

> > 


^ Cl o r- CO 

X ^ CO CO 40 

X X -T* Cl X lO 

o cb Cl X X 


Cl ^ CO r-* 

05 Cl C CO ^ 

X 40 O X ^ ^ 

o ^ cb cb X X 


r.i ^ 

s g 

cs o 

< c 

H © 


£ 

a X 

o X 
C* lO 
X 

s® 


40 Cl CC ^ 

X o o ^ 

40 ^ 00 c* 
X d o c 
CO CO o 
^ o c 
X o O C 


40 X Cl Cl r- -t 
O lO X X o 
X X ^ X d 

o ^ eb cb cb X 


> X ^ ^ 40 ^ 


XCIO'^OI^ 

CO X Cl o X 

l;* 1^ Cl O X 
o c*! X cb X 


X Cl o Cl 

^ ^ 40 X o 

d X d CO 

O d cb cb X 


O ^ Cl X 40 


S o 40 
2 40 X 
5^ t-* 40 
OQ d X 
X CO 
o O r* 

I 


CO X Od 
O X X 
CO Cl *^4 

Cd X 
X X 

X o ^ 
• • • 

— o o 


to 

C 0 ^ 

•m 9 

a 2 S^ 

i-i-s 

<g- 


O ^ ^ X 40 

p-H fh d X X ^ 


• - u fl • 

S - o 

a 1 .£ 2 
^ O d g t. 

Illll 


s 

s 



TESTING LINEARITY OF REGRESSIONS 131 

the correction term used in determining the sum of squares is 

112-8982 . 

———, i.e. 265-540800 

%2)=311*817918, so giving 5(7/-y)2=46-277118. The block totals 
each include 12 observations, and so the sum of squares for 
block differences is 

-l^(28-4892+28-0052+28-8632+27-5412)-iif_^=0-082750 

48 

Similarly, as 8 samples are taken on each of the six occasions, 
the sum of squares for sampling times is 

J(5-9372+12-1192+ . . . +27-5982)-i.e, 45-618191 

To find the item for interaction between blocks and times, the 
two samples from each block are added together, so giving 
a 6x4 table. The entries of this table are squared and the 
squares summed and divided by 2, after which the deduction of 
the correction term leaves the sums of squares for the 23 degrees 
of freedom which include the items for blocks, times and their 
interaction. The first two items have already been found and 
may be deducted from this total to leave the interaction sura of 
squares. Numerically the calculation is 

i(l-5U2+3-0112+ . . , +7-0792+1-5172+ . . . +6-7932) 

-265-540800-0082750-45-618191=0-087938 

Finally, the sums of squares between like samples, which may 
be used to supply the error variance, is found as the remainder 
after deducting the block, time and interaction items, already 
calculated, from the total sum of squares. 

Before proceeding further with the analysis of variance, which 
is set out in Table 32, it should be observed that though the 
‘ times ’ mean square is highly significant, the other mean squares, 
for blocks and interaction, are subnormal when compared with 
the error. Indeed, the interaction is significantly below the 
value of the error mean square as it gives a variance ratio of 
3-4697 with N,“24 and ^,= 15. The probability of such a result 
is just less than 0-01. There is no obvious reason why this 
should be so, but in view of the low probability, the interaction 
item will not be pooled with error to give a new error mean 
square based on 39 degrees of freedom, as would have been 
legitimate if no significant difference existed. 

Of the 5 degrees of freedom for time difference, 1 can be 
used for the calculation of a linear regression relating log dry 
weight (y) to time of sampling {x). The six sample times are 
separated by a constant interval of 7 days, so the labour of 
calculation can be reduced by assigning the x values 0, 1, 2, 3, 4 



132 


STATISTIOAL ANALYSIS IN BIOLOGY 


and 6 to the sampling times 10, 17, 24, 31, 38 and 45 days 
respectively. Now 8 samples were taken on each occasion, 
viz. two from each of four blocks, so that each x value is used 
8 times. Hence 


iSf(a:)=8(0+l+2+3+4+5)= 120*0 
and ^Sr(x2)=8(02+12+22+32+42+52)=440*0 

1202 

Then S(x-x) 2=440- 140 

' ' 48 


iS'(xy>-»(0x5*937)+(lxl2-119)+(2xl7*769)+(3x23*023) 

+{4x26*452)+(5x27*598)=360*524 
and the correction term necessary to reduce this to i 8 [y(x-i)] is 

120x112*898 


48 


-282*245 


Hence ^[7/(x-f)]=360*524-282*245=78*279 

78*279 

This gives 6 = ^ =0 559136, i.e. the log dry weight of each 

sample increases, on the average, by 0*559136 over each period 
of seven days. The sum of squares of y for which this regres- 

78*279® 

sion accoimts is -r—> i.e. 43*768585. The analysis of variance 

140*0 

now takes its final form of Table 32. The linear regression is 
highly significant, but we are more interested in the mean square 
for ‘ times ^ remaining after the regression item has been sub¬ 
tracted. This is ^ - ^ . ^^^^ =0*462402, which when divided by the 

4 

error mean square gives a variance ratio of 22*730 for 4 and 
24 degrees of freedom. The probability of a fit as bad or worse 
with the hypothesis of linear regression is shown to be very 
small, and it must be concluded that the straight line is inade¬ 
quate for describing the relation of dry weight to time of sampling. 
A regression of higher order must be used. 


36. THE CHOICE OF ORDER OF A POLYNOMIAL 

Having concluded that a regression of order higher than the 
first is needed, it is next necessary to decide just how high the 
order must be. 

If n points are available to show the relation between x and 
y, a curve of order n -1 can be found to give a perfect fit with 
these points. Thus two points can be joined by a straight line 
which is, of course, a curve of the first order, and has the general 
formula Y=a+hx{x-x). With three observations a quadratic 
curve, F“ 0 + 6 i(x-x)+ 6 *{x-:r)®, will pass through all points. With 



THE CHOICE OF ORDER OF A POLYNOMIAL 133 

an^ so^^on ^ ^=^^+ 6 i(^-^)+ 62 (a:-x) 2 + 63 (a:-;r) 3 , is necessary, 

It will be seen that the order of the curve necessary to give 
a perfect fit in all cases is the same as the number of degrees 
ot freedom available for differences between the observations 
We have seen that the calculation of a first order or linear 
regression can be used to take out a sum of squares of w, the 
dependent variate, for 1 degree of freedom. So it might be 
expected that the calculation of a second order or quadratic 
regression term would take out an additional sum of squares for 
a second degree of freedom, and so on. The calculation of n-1 
regression coefficients would then remove n-l items from the 
sum of squares, each for 1 degree of freedom. In this way the 
degr^s of freedom would all be accounted for; but if the pro¬ 
cess is equally to account for the whole of the sum of squares, 
each of the n-1 items must be orthogonal to the rest. When 
this criterion is not satisfied the various items cannot be incor¬ 
porated in the same analysis of variance. Now orthogonality is 
not obtained if regression items of the form bi{x~x), bzix-x)^, 
bi{x-x)^ are used, but this difficulty can be overcome by 
a modification of procedure. 

It has already been shown that in Ashby’s experiment the 
first-order regression coefficient, 6, is 0*559136 and the mean of 
all the log diy weights is 2*352042. So the first-order regression 

equation is: 

r-2*352042+0*559136(x-x) 

where 5=2*5. Now the two portions of this equation take out 
independent sums of squares in the partition of S{y^) as discussed 
in Section 32. But we can rewrite the formula as 


7-2*352042+0*559136(a:-2*5) 

-2*352042-(2*5x0*559136)+0*559136a: 

-I-102160+0-559136a: 

In this last equation the two parts do not take out independent 
items in the partition of Yet it is only another form of 

the first equation in which the two portions were orthogonal. 
The difference between the two lies in the fact that in the latter 
equation the second term is concerned solely with x, while in 
the former version, which represents the usual type of linear 
regression equation, the second term includes items based on 
both X and 5, the latter being a constant for any set of 
data. 

In the same way a second-order regression coefficient, 
need not be concerned solely with x^, but can control any term 
of the form (a.+px+x^) and the values of a and can be chosen 
to make this term orthogonal to any other term in the regression 



134 STATISTICAL ANALYSIS IN BIOLOGY 

equation. So instead of calculating the coefficients 6 i, 62 , &c,, 
in the equation 

r=a+6i(a:-f)+6j(x-^)2+68(^-^)® • • • 

we find bij 63 ', 63 ', &c., in 

r=a+6/li+6,'|3+63'^. . . . 


where is of the form ax+^r 

1 2 is of the form qlz+^sX+x^ 

1 3 is of the form &c. 


and fi, I 2 and ^3 are chosen to be orthogonal to one another. 
Each successive regression coefficient will then represent an 
independent sum of squares in the analysis of variance of y. 

In this way it is possible to find, first, the sum of squares 
removed by a linear regression involving fi, second, the additional 
sum of squares accounted for by the introduction of the quadratic 
item Is, third, the item removed by I 3 , which is cubic in x, and 
so on. The sum of squares between n observations is thus par¬ 
titioned into n -1 items, each for 1 degree of freedom and each 
of which corresponds to a regression component of characteristic 
order in x. When this is done the choice of order of a polynomial 
adequately representing the relation of y to x is made easy. ^ 

In order to specify the exact relation between the various I s 

and X it is necessary to know the values of ai, aa, ^a, aa, ^ 3 » y»» 
&c. These can be found by a consideration of the properties of 
orthogonal functions as developed in Section 23. First let us 
take the case of three double observations (Xiyi), (x,ya) and (x^a). 
the X values being 0, 1 and 2. There are 2 degrees of freedom 
between the three values of y, and so a quadratic regression ot 
y on X will give a perfect fit with all three points on the ^aph 
relating y to x. The complete analysis will thus involve two I func¬ 
tions, The first, li, will be linear with respect to x and the 

second, la, will be quadratic in x. 

In the analysis of variance of y each of these functions win 

take out a sum of squares of ^^^Jciyi+ktyt+k^yaY, where ki is 

the value of | when Xi is substituted for x in the general formffia, 
k, the value of I when x* is substituted for x, and the value 
of I when x, is substituted for x. A comparison of this forama 
for the sum of squares removed by each | function with 
formula for the sum of squares removed by a simple re^^ssion c 
efficient shows that (x-x) in the latter is replaced by the vano^ 
k values in the former. The formula is, in fact, merely re-^tten 
in the form with which we have become famihar from the 


cussion of the analysis of variance. 

Now if the various sums of squares are 


to be independent, 



the choice of obder of a polynomial 135 

conditions must be satisfied. These are that SikhO for each 
function and )=o for each pair of functions. Inlr example 

2 respectively, and so i,,=a.+a-„ and kJ^JZ 


Then >S(/,*i)-3ai+*S'(;r)=0. Hence ai=- 


S{x) 


X. So and 


^n=-l, i*i 2=0 and A:i 3=1 as x=l. 

thelelr^lT*^ ^ function is quadratic with respect to a: and has 

talufs for . Substituting tlie three specific 

values lor x in this formula we have 

^ 2 i^ot 2 +^sXi+Xi\ k 22 =ct, 2 +p 2 Xi+Xz^ and 

S{k2)=3x2-i-p2S{x)+S{x^)=0 

But we can obtain a second equation from the fact that SiLk.) 
must also equal 0. We already know that ^ 

ki 2 =‘X 2 —x and ki^^x^—x 

Then S{kJc2)=~3xa.2+(<X2-^2x)S{x)+{P2-x)S{x’^)+S{x^)^0 
inese equations may be simplified by substituting 0 , 1 and 2 
or X 2 and x^. S{x)=3y S{x^)=5 and iS'(a:3)=9, with f=l. Hence 

'S'(X.'2)=3a2+3y?2+5=»0 . . , , (i) 

*S:(^i^*8)=-3a2+3(aj-^2)+5(^2-l)+9 

^ =2^4=0.(ii. 

From (ii) ^2=-?=-2 ^ ^ 

Substituting in (i) 3a3-6+5=0 

and a 2 =J 

So i2=i-2x+x^ 

and substitution for x in this general formula gives 

^21= 2ij: i+x 12= J - 0+0= J 

fcti=k-2x2+X2~=l-2+l=-§ 

^2S=i~2X3+X3^=^-4+4=^ 

We now have the two orthogonal functions which will par¬ 
tition the sum of squares of y into two parts dependent respec¬ 
tively for their values on the first order, or linear, and on the 
second order, or quadratic, regressions of y on x, as determined 
rom the three double observations {x,y,), (a:,y 2 ). {x,y,) when 

•Ti-u, Xj-l and 2 : 3 - 2 . It may be noted, too, that as the second 
component of the sum of squares, viz. 

can be re-written as 


i(yi-2y,+y,)* 



136 


STATISTICAL ANALYSIS IN BIOLOGY 


the k values can be made integral * thus 

^21=1 

the general formula becoming |B=l“6:r+3a:^. A glance back to 
Section 29 will show that these I’s are, in fact, the two func¬ 
tions already used in the analysis of the effects of phosphatic 
manure when applied at the three levels 0, 1 and 2, 

One further point should be noticed about these ^ functions, 
is already in the form (z-x), but ^2 has been found as \-2z-\-zK 
We know that x=\ and so 

i=l-f-2a:+a:^ 

- f 2a:f +X 2 = (x 2xx-\-x) 1 

~(x-xY-% 

Thus both I’s can be written in terms of the deviation of x from 
its mean. So, provided that (x,-x^Hx,-x,h-h the same kB 
will be obtained no matter what the absolute values of Xi, Xt 

and Xs may be. , . 

Ashby’s data can be treated in exactly the same way as tins 

relatively simple case of three observations. Six sampling times 

were used, and these have already been denoted as Zi Xt, 

where a:i=0, a:*-!, x^^2, &c. Now fi-ax+a:, and so 

Hence ^(ifcd=6a^+5(a;)-0 and a,—x. In this case x~2i and so 

coefficients may be multiplied by 2 to make them mtegra, 
whereupon they become -5, -3, -1, 1, 3 and 6. Then 

^(jfci«)=(25+9+l+l+9+25)-70 

and the sum of squares accounted for by linear regression is 

^(-52/i-3y,-2/,+y*+3y5+6y.)* 

Turning to the quadratic regression ^t^at+ptX+x* 
and fcii^ai+^iXi+Xi*, &c. 

Hence S{kt)=^Ga.t+piS(x)+S{x^) 

and ^(fcifc.)=- 6 a 2 X+(a,-^.x)^(x)+(^,-f)^(x*)+-S(x )-0 

Now i8(x)-15, iS(x2)=55 and 5(x3)=225, with x-2j 

So iSf(jfc,)=6a2+15^a+55=0 

and S(i!:tifc,)=-15a,+ 15(a2-|W+5503-|)+225-O 

The solution of these equations is easUy found to be 

tts-Si and 

so making f,-3j-5x+x® 


• Integral in the sense that each value is a whole number and doea 
not include any fractions. 



137 


THE CHOICE OF ORDER OF A POLYNOMIAL 

The k values are then obtained by substituting the various 
numerical values for x in this formula and 

These cc^fficients require multiplication by | to make them 
integral in the form 

^28=>-l, fc23=-4, kitr-4:^ ^2S=-1, ^28=5 

It is then seen that >S(^2')=84 and the sum of squares of y 
accounted for by the quadratic regression item is 

^(5yi-y2-42/3-4y4-y6+5?/e)* 

The expression in x of ^ 3 , the cubic regression, and of the corre- 
pionding k coefficients can be found in the same way. It will 
be of the general form 

and the three simultaneous equations necessary for evaluating 
the constants will be derived from the relations 

S(ki)=S(kik2)=^S(kzkz)=(i 

If the operation is carried through it will be found that 

^3“-3+13*7a;-7-5a:2+a:3 
fesi—3+0-0+0=-3 
1:33—3+13-7-7*5+1=4-2 
^33—3+27*4-30+8=2*4 
^*4—3+4M-67-5+27=-2-4 
jfcaj—3+54-8-120+64=-4-2 
jfcaa—3+68-5-187-5+125-3 

If these are all multiplied by | they become -5, 7, 4 , - 4 , -7 
and 6 respectively and jS(A: 3*)=180. The sum of squares accounted 
for by cubic regression is thus 

■ii7(-5yi+72/3+42/3-4^4- 7 ^ 3 + 52 / 3 )® 

The calculation of I, in this way is, however, rather laborious, 
and the calculations of ^4 and will be even more so. An 
alternative method is offered by the use of the recurrence formula 

t 11 
tr+l 

where n is the number of double observations, in this ease 6 . 
This formula can only be used to give the original k coefficients, 
i.e. it will not apply after the k’B have been multiplied through 
by some constant to make them integral. Thus we must use 

ii"X-x and f 3 - 3 j- 5 a:+a:* in order to derive I*. If it^5-~x+^x* 

2 2 

were employed a wrong answer would be obtained. 



138 


STATISTICAL ANALYSIS IN BIOLOGY 


Let US consider the derivation of & 31 , knowing that is -2J 
and kii is 3J. In this case n =6 and r+l=3. Hence 

, , , 22 ( 62 - 22 ), 

which on substituting for and k^ gives 

4(36-4) 


*«=(-2|x3i)- 

4 


( 


50 

320__3 

6 "^ 

60 


4(16-1) 

as already found by the other method. 

Similarly 

^ 33 , k^i, &c., can be found in the same way. 

Next, turning to the calculation of the coefficients necessary 
for finding the sum of squares due to the quartic regression, 
r+l=4 and so 

, , , 9(36-9), 

-[(-2i)x{-3)]-[|^x3J] 

1,050 810 12 
“ 140 140' 7 


and kii, &c., can be derived similarly. 

This process is repeated until all the A:’s of ^ 3 , ^4 and Is have 
been found. The integral k values corresponding to those found 
by the use of the recurrence formula, which are, of course, non- 
integral, are given in Table 33, together with the multipfiers 
used to make them integral and the values of S(k^) obtained 
from the integral values. The use of the multipliers can be 
seen from an example, k^i is given in the table as 1 , and the 
multiplier appropriate to I 4 is given as iV- Then the non¬ 
integral value originally found for kti was which gave on 
multipHcation It might be noted that the multiplier 

is also the coefficient of the highest power of a: in the general 
form of the | function to which it belongs. Thus the 
given for I* is f, and so we know that the coefficient of x , tne 
highest power of x, in that form of |i which gives the val^ 
of k shown in the table, is It will be remembered that tne 
full formula has been established as 

The material necessary for the analj'sis of variance 
relation to its polynomial regression on x has 
for the case of six observations in which x takes the v 
0 1 , 2 ... 6 . It may be mentioned that similar sets 

c^fficients are given by Fisher and Yates for the analysis 



THE CHOICE OF ORDER OF A POLYNOMIAL 139 

all cases up to 7i=52, though they conhne their attention to the 
tiTSt hve f functions for the cases where n*6 or more. 


TABLE 33 

Orthogonal Polynomial for the Analysis of Six Double Observations 


n 

0 

1 

0 

3 

4 

6 

Sum of 
squares of k 

•IP' wvr 

Multiplication 

factor 


-6 

-3 

-1 

1 

3 

5 

70 

2 

M- 

5 

-1 

~4 

-4 

-1 

5 

84 



-5 

7 

4 

-4 

-7 

5 

180 

/> 

1 

^4 

1 

-3 

2 

2 

-3 

1 

28 

T 

7 

^6 

-1 

5 

-10 

10 

-5 

1 

252 

rj 

'ih 


In Ashby’s experiment, eight samples were taken at each 
time and so each z corresponds to eight y values. Thus each 
k will be used eight times, once on each of the eight values 
corresponding to the same x. So the 8 (k^) used in finding any 
sum of squares will be eight times the value given opposite the 
appropriate k values in Table 33. The arithmetic is simplified 
by remembering that the sum of the cross-products of and 
the eight separate y’s is the same as the cross-product of k 
and the sum of the eight y values. Thus in the calculation of 
the linear regression sum of squares This corresponds 

to the eight observations 0-748, 0-763, 0-845, 0-672, 0-881, 0-785, 
0-580 and 0-663, of which the sum is 5-937 (Table 31). So instead 
of performing eight separate multiplications and afterwards 
summing the products, we merely multiply 5-937 by -5. The 
sums of y, corresponding to each of the six sampling times, 
are given in Table 31 and these are multiplied by the X:’s of 
Table 33. The sum of squares for linear regression so obtained is 

^^[(-6x5-937)+(-3xl2-119)+(-lxl7-769)+(lx23-023) 

+(3x26-452)+(5x27-598)]2-:i^"-^ =.43-768585 

500 

as obtained in the previous section by a different method. 

The sum of squares for quadratic regression is found similarly, 
but using to be 

g^[(5x6-937)+(-lxl2-119)+(-4xl7-769)+(-4x23 023) 

+(-lx26-452)+(5x27-598)]2-t^il?|^=1.726720 

0/2 

The remainder of the five orthogonal items are calculated in 
just the same way and the results are shown in Table 34. Their 
sum is 46-618191, as found in the original analysis of variance 
of Table 32. This provides an arithmetic check of the success 
of the partition* 




140 


STATISTICAL ANALYSIS IN BIOLOGY 



OALCXTIiATION OF A POLYNOMIAL REGRESSION 141 

These five items each correspond to 1 degree of freedom, and 
so may be compared with the error variance of the experiment, 
^ given in Table 32, by means of a t test. The test of the 
linear item has already been made in the previous section. The 

/l‘7267‘^0 

quadratic item gives ^=9-213 with a very small 

probability. There is a significant quadratic term in the regres¬ 
sion of y on x. The cubic item is suggestively large, with 
a probability of less than 0*05, and so will be worth including in 
the regression analysis, but the mean squares due to the quartic 
and quintic terms are subnormal. Hence these items may be 
omitted from the regression equation. This could actually have 
been shown to be the case without calculating the quartic and 
quintic items separately, as the sum of squares remaining when 
the linear, quadratic and cubic items have been isolated is sub¬ 
normal. In either case it is clear that a cubic regression equation 
will adequately represent the relation of y to x. 

In this example it has been possible to use an error variance 
provided by the replication of the experiment, but even if this 
mean square had not been available it would still have been 
possible to make a partial analysis of the regression. The sum 
of squares remaining after deduction of the items imder con¬ 
sideration would then have formed the estimate of error. Thus 
linear regression removes from the ‘ times ’ item a sum of squares 
of 43‘768586 for 1 degree of freedom, leaving a remainder of 
1-849606 for 4 degrees of freedom. This gives a mean square 
of 0*462402, against which the linear regression mean square is 
highly significant. The next deduction is 1*726720 for the 
quadratic regression, which leaves a remainder of 0*122886 for 
3 degrees of freedom. The remainder mean square is thus 
0*040962 and the quadratic item is again significant. This pro¬ 
cess is continued, but it does not provide a very good test of 
the cubic and quartic items as the number of degrees of freedom 
in the remainder is getting very small. There is, of course, no 
test at all of the quintic regression term since no error mean 
square remains. It will be seen then that the provision of an 
error variance by replication is greatly to be desired, even though 
its omission may not be fatal to the analysis. 

37. THE CALCULATION OF A POLYNOMIAL REGRESSION 

It is now known that a cubic curve is necessary to express 
y in terras of x in Ashby’s experiment. The next and last part 
of the analysis is that of finding the formula of this curve. 

The calculation of the various coefficients in the polynomial 
regression of the form 

10 



142 


STATISTICAL ANALYSIS IN BIOLOGY 


is a special case of the much wider question of finding the 
regression of y on two or more variates, which may themselves 
be non-independent. The degree of dependence of these variates 
must have an effect on the regression coefficients. If they are 
independent of one another every variate is capable of supply" 
ing further information for the prediction of F, while if the 
dependence is complete the second and all later variates supply 
no information not already obtained from the first one. In such 
a case the calculation of the regression of y on any but the first 
variate is a waste of time. 

There is a general treatment of regression on a number of 
variates, which may not be mutually independent, but this will 
be reserved for discussion in the next section. In the case of 
polynomial regression there exists a special way of overcoming 
the difficulties, which depends on the use of the ^ functions 
already developed. 

The difficulty of fitting the regression coefficients in the 
polynomial 


Y=arhby(x-x)-¥bi(x-z)^-¥bz(x~x)^ 

depends entirely on the fact that (x~x), (x-x)^, &c., are corre¬ 
lated, i.e. not independent. If independent, the various regres¬ 
sion coefficients could be found without reference to one another, 
just as 6 is found in the case of a linear regression. But we 
have already seen that the polynomial can, by use of the ^ func¬ 
tions, be recast in a form where each term is independent of the 
rest. Hence the use of these functions affords a way of sur¬ 
mounting the difficulties of estimation, in exactly the same way 
and for the same reason as it enabled the sum of squares of y 
to be partitioned in the analysis of variance. Instead of esti¬ 
mating 6,, &c., in the polynomial regression already given, 
we estimate 6/, b^y &c., in the equivalent 

When these have been found they can be used to reconstruct 
the first pol 3 momial equation, since the relations of li, 
to X, x^y &c., are known (see Table 33). 

Each of the regression coefficients 6/, b^y &c., can be calcu¬ 
lated independently of the rest, as &c., are mutually 

independent. So we can put 


h u , S{$,y) 

In every case as ^(A:)=0, so that ^ itself replaces the term 
(x-x) in the commoner type of regression equation. In the case 
of Ashby’s data it is necessary to remember that there ^ am 
eight y observations corresponding to each time, x. So *i>(> > 



CALCXJLATION OF A POLYNOMIAL REGRESSION 143 


will have eight times the value of the sum of squares given in 
Table 33. 

The arithmetic is very simple. From Tables 31 and 33 
5{^l^/)=>S'(^■,2/)=[(-5x5•937)+(-3xl2•119)+(-lx 17-769) 

+(lx23023)+(3x26-452)+(5x27-598)]=156-558 

and from Table 33 

{8^^^)=^S{k,^)=8xl0=5Q0 

156'558 

Hence 6/=—=»0-279568. This regression coefficient is the 

one appertaining to the linear regression of y on x. But we 
found in Section 35 that the linear regression coefficient was 
0*559136, i.e. exactly double the value now reached. The reason 
for this is not difficult to see. was originally found as (x-x)^ 
but it was multiplied by 2 in order to make the k values integral. 
Hence now is 2 {x~x), and so the regression coefficient should 
be half that found when (x-x) is used as the independent variate. 
, , Sj^^y) S(k^y) 

=-^[(5x5-937)+(-1x12-119)+(-4x17-769)+(-4>c 23023) 

o4x 8 


-34064 

672 


+(-lx26*452)+(5x27*598)] 


-0050690 


L / S{^^y) _ S{k3y) 

S{k,^) 

-:p^[(-5x5-937)+(7xl2*119)+(4xl7*769)+(-4x 23-023) 

loOx o 


•f(-7x26-452)+(5x27-598)] 

1,440 

To complete the calculation we note that, as in the case of 
simple linear regression, a=y=2*352042. 

Hence r«2*352042+0-279568|,-0-050690|,-0-009057f3. 

It will have been observed that all the material needed for 
the estimation process was available from the partition of the 
sum of squares of y. Whichever is undertaken first, the par¬ 
tition of the variance of y, or the estimation of 6/, &c., 

the material for the other calculation is automatically available 
and little extra work is required to complete the analysis. 

The regression equation of y on and can be used 

directly for the calculation of the numerical values of Y when 
a:-0, 1, 2, 3, 4 and 5. Furthermore, this equation can be used 
to observe the progression in goodness of fit of expectation with 
observation, when the various terms are added one by one. When 


]44 STATISTICAL ANALYSIS IN BIOLOGY 


gives 


x=a:i=0 f .=-5, fj=5, ^,=-5. Then the best-fitting linear regression 

Y,=a+bi’(^ 

=2-352042+0-279568f, 
r„-2-352042+0-279568x (-5) 

=0-954202 
The best-fitting quadratic 

Y^=a+b,'i,+b,'S^ 

=2-352042-i-0-279568|,-0-050690f, 
r.,-0-954202-(5x0-050690) 


gives 


0-700752. 


The best-fitting cubic 

=2*352042+0-279568^i-0-050690|,-0-0090571, 

gives r3i=0-700752-(-5x 0-009057) 

=0-746037 

The eight values of y observed when x=0 totalled 5 - 937 , so 
their mean was 0-742125. The approach of expectation to this 
observed value as terra after term of the regression equation is 
added is very clear. The mean values of y observed at the six 
sampling times are compared with their linear, quadratic ana 
cubic expectations in Table 35, and are shown graphicauy m 



improvement in fit 



OALOCTLATION OF A POLYKOMIAL KEGRESSION 


145 




146 


STATISTICAX ANALYSIS IN BIOLOGY 


The polynomial 

7=a^bi(x-x)+bt(x-xY-¥bz(x-x)^ 

can be reconstructed from the equation relating Yy iz and if 
From the previous section ii=2{x-x) and, since 5=2*5, 
f2=f(a:2-5a:+3j)=|[(x-5)2-2*9l6]=l*6(a:-5)--4*3750 
and |3=f(rc3-7-5a:2+13*7a:-3)=f[(a:-5)3-5*05(x-5)] 

= l*6(a;-5)3-8-4l6(a:-5) 

Then 6/fi=0*5591(a:-5), 6/l3=0*2218-0*0760(a:-x)* 

and 6/|3=0*0762(a:-5)-0-0151(a:-5)® 

So 7i=a+6/li=2*3o20+0*5591(a:-f) 

73=a+6/^i+6*'l2=2‘5738+0*5591(a:-5)-0-0760(a:-5)* 

rs=a+6.'li+6/l2+63'l3“2*5738+0*6353(a:-f)-0*0760(a:-5)* 

-0*0151(a;-5)* 

The effect of the correlations between x, x^ and x^ is obvious 
from the way that the addition of the term involving {x-x)^ 
changes the term independent of x, and the way that the addition 
of (x-f)3 increases the coefficient of (x-5) from 0*5591 to 0*6353. 
The utility of the i functions depends on the fact that the addition 
or subtraction of any term has no effect on the rest of the 

regression equation. 

38. REGRESSION ON TWO OR MORE VARIATES 

Polynomial regressions are special cases of the calculation of 
regressions on two or more variates which may not be independent 
of one another. The general solution to this problem has been 
found and is described under the name of Multiple Regression. 

Let us consider the regression of y on two variates Xi and Xf 
which may be mutually dependent. We wish to find a, 6i and 
6, in the regression equation 

r=a+6i{Xi-5i)+6s(Xa-5,) 

from n triple observations of the kind (y, Xj, x,). The method 
of least squares can be employed for this purpose. 

(y- 7 )=y-a-6i(xi-x)-6,(x,-x,) 

{y-7)®»[y-a-6i(iri-^i)-W^*-^3)]* , , , .. 

-y*-2ay+a*+6i*(xi-xi)*+6,*(x*-x,)*+25i6j(xi-x,)(x, X|) 

-2[6,(Xi-xd+6i(^.-^*)][y-'*J 

and since a, 6i and 6, are constants 

<S(y-7)*“5(y2)-2a5(y)+na*+6i®iS(Xi-Xi)*+6i*5(x,-x,)* 

+26i6,5[(xa-xd(a;2-xa)]-26,5[y{xa-xd] ^ 

-25aiS[y(x,-x,)]+2o[6i5(xi-Xi)+6»5^(ari-x,)J 

and the last term vanishes as jS(Xi-Xi)-(8(x,-x,)-0. 

This sum of squares of deviations of y from the expected Y 



REGRESSION ON TWO OR MORE VARIATES 147 

is next minimized by partially differentiating with respect to 
a, 6i and 6a, equating to 0 and solving. 

~S{i/-Yy‘=2na-2S{y)=0 

^S{y-Yy=2biS[{Xi-Xj)(x^-x^)]+2b^S{z^-Xty~2S[y{x2-Xi)]=0 
Then, as before, a=y and 6^ and 6a are the solutions of the equations 

6 i<S(a^i—X|)^+ 62 jS'[{a:i—^ ^ j)]=*S'[y(3;j—^i)] 

6i5[(.ri-fi)(a:a-ia)]+62iS(a:a-i2)2=5[;/(a;2-f2)] 

The covariance of Xi and Xz enters into the estimation of 6i and 63. 
Thus the necessary adjustments for any mutual dependence of 
Xi and X2 are automatically made. Where Xi and x^ are inde¬ 
pendent, 5[(a:i-5i)(x2-:ra)] will be 0 within the limits of sampling 
error. The equations then reduce to: 

b2S{x2~X2y=S[y{x2-Xt)] 

These are recognizable as the types used in calculating the 
regressions on and $2 in the last example, and ^2 having 
been deliberately chosen to be independent of one another. 

The solutions to the general equations of estimation found 
above are easily shown to be 

6 i^ S[y{x2-X2)]S[(x2-x,)(x2-X2)]-S[y{ X i-Xi)]S{x2-X2y 

s^[ixi-xi){x2-x2)ys{xi-x^ys{x2-x2y 

b2^ S[y{xt-x,)]S[{x^-X2){x2-X2)]-S[y{x2-X2)]S{x,-Xiy 

S%{x,-Xj){x2-X2)hS{x2-XiyS{x2~X2y 

If these are then substituted in the expression found for 
S(y-Yy, simple but tedious algebraic manipulation will show 
that the sum of squares of y removed by the two regressions 
when taken together is 

bxS\y(xy-Xy)Yb28[y{x2-X2)] 

SHy\ 

so that S{y-Yy°S{y^)-~^-b^8\y{x^-x^)]-b2S[y(x2-X2)] 

7t 

Though this sum of squares due to regression is expressible as 
the sum of two terms which seem to be ascribable to the relations 
between y and and y and x^ respectively, it is not, in fact, 
capable of being so subdivided unless x^ and x^ are independent. 
When this is not the case, the joint sum of squares incorporates 
the necessary allowance for the non-independence of x-^ and x^, 
and it is clear that the sum of squares is not divisible. Two 



148 STATISTICAL ANALYSIS IN BIOLOGY 

parameters have been estimated and so the sum of squares duo 
to regression corresponds to 2 degrees of freedom. Hence 

y S{y-YV 

" 71-3 

In order to find the variance and standard error of 6i we note 

N 

that 6,=^ where 

and D=S^[{x,-Xi)(x,-x,)]-S{x,-x,VS{x,-x,)^ 

Now, as we have shown in Section 32 

and the covariance of <S[2/(x,-Xi)] and )S[i/(x2-Xs)] is 

V ^[{x^-x^)(xt-x^)] 

the second term depending on the covariance. This reduces to 

^iV 


Then, since D is independent of y, 




F„S(x,-x.)* 


, „ f^vS(x.-x.)» _ 

Similarly ''»=s(x^_f.)7S(x.-x.)*-i'‘‘l(x,-x.)(x.-i;)] 

A very convenient method of estimating and b, which at 
the same time considerably eases the calculation of Fj, and Kj. nas 

been introduced by Fisher. 

The equations of estimation of Oj and o* are 

6,S(x,-xO*+6.-S[(a:.-«,)(x.-x.)]-S[y(x2-x.)] 

6 .S[(x,-Xi)(x,-x.)]+ 6 ,S(x,-x,)*-/S[y(x,-x,)] 

Now if we set up two new pairs of equations like th^e but mth 
1 and 0 in the tot pair, and 0 and 1 m the second pair, sub 
stituted on the right-hand sides we can find 

Cu“-i'S(x,-X2)2 c„=^S[(x.-xd(X2-a:.)] 


D 

c,i=^S[(x,-x.)(a;2-5.)] 


D 

C2,"-^S(X,-X,)* 


where H-S»[(x.-x.)(x,-x,)]-S{x.-x.)*-S(x.-x.)*, c.. and e^, being 

the values of b, and 6. in the first pair of “jons 

c„ being the values of b, and 6, m the second pair of equations. 



REGRESSION ON TWO OR MORE VARIATES 149 

A comparison of Cn and Cn with the value of found by 
solving the true equations of estimation shows that 

and, similarly, 

6a=Ci25[2/(a:i-fi)]+C825[i/(a:t-:r2)] 

Furthermore, when Cn is compared with it is seen that 

and similarly F6=C2tFy. 

It may be added that this method of estimation holds distinct 
advantages for the treatment of complex cases of regression on 
several variates in that it makes easy the omission of a variate, 
if this should be deemed necessary. Reference should be made 
to Fisher’s account of multiple regression for a further discussion 
of this operation. 

Example 16 . Buck’s data (Table 36 ) on the dependence of 
the time of the first fiash of male fireflies on the intensity of light 
and on temperature provide a case which can be analysed by 
the method of multiple regression. The time of first flash, y, is 
given as minutes after 6.40 p.m., the temperature, in degrees 
Centigrade and the light intensity, x^, in metre-candles. To what 
extent do the temperature and light intensity influence the 

onset of flashing ? 

TABLE 36 


The Time of First Flash of Male Fireflies in relation to 
Temperature and lAght Intensity (Buck) 


Light-intensity in 
roetre-candles 
(®i) 

Temperature 
in ^C. 

(Xt) 

Time of flash in 
minutes after 
6.40 p.m. 

(y) 

26 

2M 

35 

35 

23-9 

30 


17-8 

48 

41 

220 

40 

45 

22-3 

21 

55 

22-3 

42 

55 

23-3 

0 

55 

20-6 

44 

66 

25-6 

28 

70 

217 

30 

76 

26*7 

18 

79 

260 

23 

84 

26'7 

19 

87 

24-4 

26 

100 

22-3 

26 

100 

22-3 

6 

100 

26-6 

36 

110 I 

26-7 

30 

130 

26-6 

21 

140 

28-7 

30 







160 


STATISTICAL ANALYSIS IN BIOLOGY 


From the table we find that 

S(x,)= 1.483000, 5(x,=)=129,409000, 5(x,-x.)’*=19,444-550 

S{x,)= 473-200, -S(x,2)= 11,308-720, S{x,-x,)^^ 112-808 

S{y) - 557-000, S{y^) = 18,237-000, S{y-y)‘ = 2,724-550 

S{x.X2)=35,983-200, S[{x,-x,){x^-x,)]= 895-420 

S(xu/) =39.213-000, S[ 2 /(x,-x,)] =-2,088-550 

5(x^) =12,857-700, S[j/(x.-x,)] - -320-920 

Then we -write down as the equations for the calculation of 

®ji» ^ai and C 22 • 

6il9,444-550-i-6,895-420=1, 0 
6, 895-420-H62112-808=0, 1 

giving as the solutions : 

c„= 0-000081056 c„=-0-000643389 
Cj,=-0-000643389 0 ,,= 0-013971556 
It should be noted that the relation provided a check on 

the arithmetic of the solution. 

Then 6 i=c„ 5 [t/(a:i-Xi)]+c*i'S[ 2 /(a: 2 -^i)] 

=-(0-000081056x2,088*550)+(0000643389x320'920) 

=0 037186 

and 62 =Cia 5 [t/(xi-Xi)]+c„< 8 '[y(x,-fa)] 

=(0-000643389x2,088-550)-(0013971556x320-920) 

=-3*140002 

-2,724-550+(0-037186x2,088-550)-(3‘140002x320*920) 

-1,794*5254 

There are 20 sets of observations in all, so that 17 degrees of 
freedom remain after the estimation of 6 i and 6 *. 


Then 7 105.56032 

^ 17 


“ VT^ = '\/0-000081056x 105-56032-0-092501 

Str V0-013971556x 105-56031= 1-214433 

The significance of the departure of 6 i from the value of 0, 
which would indicate independence of the time of flash and hght 

intensity, can be tested by finding ^ must have 

17 degrees of freedom, as the sampling variance of y is determined 
on the basis of 17 comparisons. 


(>037]^^0.402 with a probability of 0-7-0-6 
0-092501 

Similarly to test the significance of 6 ,. 



^*-0-^‘14QQ0^ ,9.R«fl with a probability of less than 0*02 
1-214433 



REGRESSION ON TWO OR MORE VARIATES 161 

The time of flashing is thus demonstrably dependent on 
temperature but not on light intensity over the range available 
at this time of day. 

Now if the simple regression of y on is found without 
reference to it can be shown that the test of significance of 
dependence of y on x^ is given by fji8]=l'266 with a probability 
of just over 0-2. Even in this case significance is not obtained, 
but the probability is much lower than that of 0-7-0'6 found 
when Xi is included in the analysis. The non-independence of 
Xx and x, results in a spuriously large apparent dependence 
of y on Xx unless the relations holding between Xx and x^ are 
taken into account, as by the use of the method of multiple 
regression. 

Before leaving this subject the general formulae for the cal¬ 
culation of multiple regressions of y on j different variates may 
be given. They are derived by an extension of the analysis given 
above for the case of Xx and Xj. There are j equations : 

6i<S'(xi-Xi)2+685[(Xi-Xi)(xa-X2)]+ . . . 

6i<S'[(Xi-Xi)(Xi-2,)]+6,)S(Xj-X2)^+ . . . 

+6jS[(X2-X2)(X,-fy)]-5[!/(X2-f2)] 


6l5[(Xi—Xi)(X;—Xj)]+68iS’[(X2—X2)(Xj—x^')]+ . • . 

+biS{x^-x^Y=S[y(x^-Xi)\ 

which may be replaced by j sets each of j equations with the 
right sides altered to: 

1 , 0 ... 0 
0,1 ... 0 


0,0 ... 1 

giving as solutions for 6 i, 63 , &c. 

Cxxy Cl, . . . Cxf 6i-Cii5[7/(Xi-Xi)]+c,i5[y(x,-X3)]+ . . . 

+Cji5[y(x^-Xj.)] 

C 21 , c„ . , . c,j 68“Ci3iSr[y(Xi-Xi)]+c,j.S[y(x,-x,)]+ . . . 

and ^Ci 2 S[y{x^-x^)] 

« • • • • • • 


• • 

Cj, 



brCi^S[y{xx-x^)]+CtjS[y{Xt-X2)]+ . . . 





162 


STATISTICAL ANALYSIS IN BIOLOGY 


Checks on the arithmetic are provided by the fact that Cu^Cn, 
and so on. 

S(y-Y)^=S(y-y)^-biS\:y{Xi-Xi)]-biS[y{Xi-Xi)]- . . . 

-bjS[y{xj-Xj)] 

and since j parameters have been fitted 

The variances of the regression coefficients are obtained firom 
the formulae 

V,rCnV, 

^6=C«F^ 


It will be observed that since Xi can bear any relation to 
Xtf Xiy &c., without invalidating the calculation, we can take 
Xi-=Xi^, Xi=Xi^, &c. The calculation of polynomial regressions 
may thus be undertaken by the use of the multiple regression 
technique. The recalculation of Ashby's data by this method 
may be made as an exercise. 


39. DISCRIMINANT FUNCTIONS 

A new device introduced by Pisher closely resembles the 
technique of multiple regression in some ways, though it is 
designed for the solution of a rather different class of problem. 
This is the calculation of so-called discriminant functions. A 
number of uses have already been found for these, ranging from 
anthropometric classification of skulls to parent selection in plant 
breeding and to the choice of speciality salesmen, but the most 
common use will perhaps be found in the handling of taxonomic 
data. We shall ffiscuss discriminants in this connexion; their 
use in the solution of problems in other fields of research involves 
the same statistical analysis. 

Suppose we have a niunber of individuals of two species or 
varieties from which certain measurements have been taken. 
It may be difficult to distinguish the species or variety to which 
an individual belongs. We might then wish to find that linear 
compound of the available measurements which will give the 
smallest possible frequency of misclassification when used ^ 
a means of discrimination. Such a linear compound is termed 

a discriminant function of the measurements. 

Let us call the two species or varieties A and B respectively 
and, for the sake of simplicity, confine our attention to the case 
where two measurements, x, and X|, are available from each of 
equal numbers of individuals of two groups. If we call the linear 


DISCRIMINANT FUNCTIONS 


163 


compound X, the problem is to find those values of and Xj in 
the expression 

j+XjXj 

which will minimize the misclassification of A and B when X is 
used as the means of discrimination. We may denote the value 
of X in species A as X^ and that in species B aa Xg. The 
difference between X^ and Xg will be called D, 

Then ^^“XjXi^+XsfaA 

and XiXiB+XgXQB 

and D=Xidi+'kidt 

where and d 2 ^X 2 A~^ 2 B 

Misclassification will increase when the sampling error of 
X^ or Xg becomes larger as compared with the ddfference between 
A and B. The minimization of misclassification, therefore, in¬ 
volves making D as large as possible with respect to the sampling 
variance of and Xg, by adjustment of Xx and X*. The variance 
of X must be partitioned in such a way that the mean square 
attributable to D is at a maximum, the remainder at a minimum. 
It is not, however, actually necessary to maximize the ratio of 
the mean square of X between species to that within species. 
Maximization of any ratio proportional to this one will serve 
equally well. 

The mean square between species will be proportional to D^, 
and as X “XxXi+X*x, 

so making F^^Xi^ ^ 3 .,+ 2 XiXa 

the mean square within species ^vill be proportional to 

T-Xi2^(Xx-Xx)=*+2XxXaiSr[(Xx-Xi)(Xa-Xa)]+Xa*>Sf(Xa-fa)a 

the deviations of x, and x, being taken from the specific means, 

Z)2 

not from the general mean. So the maximization of ^ will give 

the values of X, and X, in the compound X. This maximization 

. . . 

can be accomplished by partially differentiating with respect to 

Xi and X„ equating to 0 and solving the two simultaneous equations 
that are so obtained. 

^\ t ) t \ ax. ax.; 


i. 

kXtTt^ 


2DT^-D^^=0 

5Xj dXt/ 


Hence 


ax, 

dT TdD 

ax, 


and 


154 


STATISTICAL ANALYSIS IN BIOLOGY 


T 

Since — is common to both equations it may be removed to give 

two new equations whose solutions will themselves be proportional 
to the solutions of the two equations obtained by the maximization 
process. 


and 

So 


J5—Xjdj+X2d2 
dT 

=2Xi5(a;i-Xi)2+2XtiS[(a;i-5i)(a:2-i2)], 


0Xi 

dD 


axi 

dT 




ax* 

dD 

ax, 


*2Xi5[(a;i-fi)(a;,-X2)]+2XaiSf(x,-x,)* 


d. 


Thus the equations of estimation become 

XiiS(xi-x,)*+X25[(a:i-ii)(x2-f2)]=d| 
Xi5[(3^i“2Ji)(X 2“X2)3+XaiS(3Ja“^s)^“dj 

which are of the familiar multiple regression type and may be 
solved in the same way as multiple regression equations. Putting 
1 and 0, and 0 and 1 respectively for dj and d, gives two pairs 
of equations whose solutions are 

Cii Cii 

C|2 

from which Xi^Cndj+Ctidj and Xt“Ciadi+c,jd| 

As in the case of multiple regression 

yxr^iiVx and Fji,=*CaiFx 

To find the value of Vx an analysis of variance of X is 
necessary. The partition of the sum of squares of X offers no 
difficulty. The sum of squares between species can be found 
from iS(X^)-iS'(XB) which is, of course, equal to nD, where n is 
the number of individuals of each species to be measured. This 
quantity is a compound of 2n observations each used once, and 
so the divisor used in finding the sum of squares will be 2n. 
The sum of squares between species then becomes : 

The sum of squares within species has already been used 
under the symbol T. Hence the partition of the sum of squares 
takes the form 


Between species 
Within species 


n 


T=D 




Total 


DISCREMIKANT FUNCTIONS 155 

It will be observed that T is put equal to D. This is true, 
because if we multiply the two equations of estimation of \ and 
^8 by Xi and Xg respectively, we find 

X,2,S(a:i-fi)2+XiX2iS[(Xi-f,)(a;2-X2)]=Xidi 

^1^2^[(^l~^l)(^2“^2)]+Xa^5(X2—X2)^=X2dj 

which, on summing, gives 

^i*^(^i-^i)^+2XiXa5[(a:i-fi)(a;2-£2)]+Xa2,S(a^2-fj)2=Xidi+X2d, 
i.e. T-D 

The next task is that of assigning the appropriate numbers 
of degrees of freedom to these two sums of squares. The total 
number will clearly be 2 / 1-1 as there are 2n values of X in all. 
The partition of these 2/i-l between the two items is not obvious 
until the analogy with multiple regression has been pursued a 
little further. 

Suppose that we assigned some arbitrary value to each 
individual of the two species such that all the individuals of one 
species received the same value, and that the sum of all these 
values was 0. Where, as in the case we have been considering, 
equal numbers of the two species have been measured, the values 
of 1 for each individual of species A and -1 for each of species 
B would be suitable. If the arbitrary values are then treated 
as dependent variates their multiple regression on and can 
be found. This would give for each individual of each species 
an expectation 

y=a+ 6 ,(Xi-f|)+ 62 (Xa-Z 8 ) 

where a=y=0 and Xi and x^ are the general, not, as above, the 
specific means. 

There are only two species involved and so the mean difference 
between the species will be fully accounted for by the regression 
and the only differences between y and Y will exist within species. 
Now 8 (y-Y)^ is, by definition of multiple regression, at a 
minimum. Hence the fitting of by and 6, is exactly equivalent 
in its effect on the mean squares to the calculation of Xi and Xa. 
Indeed, it can be shown that by and 6* are proportional to Xj 
and Xj. 

We have already seen in the last section that a multiple 
regression on two independent variates Xy and Xy take up 
2 degrees of freedom in the analysis of variance of y. The 
remainder of the sum of squares has 2/i-3 degrees of freedom when 
n individuals of each species are involved. The discriminant 
leads to the same partition of the sum of squares and so the 
partitions of the degrees of freedom must be alike in the two 



150 


STATISTICAL ANALYSIS IN BIOLOGY 

cases. The analysis of variance of the discriminant function X 
will thus be: 



Sum of Sqfiares 

N 

Between species 


2 

Within species 

D 

2n-3 

Total 

d(i+|o) 

2n—1 


The significance of the specific difference can be tested by 
means of a z or variance ratio, using the mean square within 

species as the error variance. , « i •? 

Example 17. The two races, A and B, of the fly Drosophiui 

pseudo-obscura are not easily distinguishable morphologically, but 

when the numbers of teeth in the sex-combs of the males are 

counted it is found that race A shows a greater mean than race B. 

There are two sex-combs on each foreleg of the male, the proximal 

comb and distal comb respectively, and the numbers of teeth m 

the two are not fully correlated. The mean numbers of teeth m 

each comb in 11 different strains of each race are shown in 

Table 37. What compound of Xp and Xj will maximize the racial 

difference and to what extent will this improve the racial 

classification of males ? 


TABLE 8? 


Mean Numbers of Teeth in the Male Sex-combs of Eleven Strains in 
each Race of Drosophila pseudo-obscura {Mather and DobzhansJcy) 


Race a 


Race B 


Proximal 

comb 

(*p) 

Distal 

comb 

(*d) 

X 

6-36 

6-24 

2-646 

6-92 

'6*12 

2-402 

6-92 

6-36 

2-434 

6*44 

6-64 

2-623 

6-40 

6*16 

2-547 

6-56 

6-56 

2-647 

6-64 

6-36 

2-644 

6-68 

4-96 

2-602 

6*72 

6-48 ; 

2-682 

6*76 

6-60 

2-710 

6*72 

6-08 

2-629 


Proximal 

Distal 


comb 

comb 

X 

i'p) 

('d) 

__ 

600 

4-83 

2-394 

6-60 

4-64 

2-245 

6-64 

4-96 

2-299 

6-76 

4-80 

2-313 

6-96 

5-08 

2-409 

6-72 

6-04 

2-333 

6-64 

4-96 

2-299 

6-44 

4-88 

2 231 

6-04 

4-44 

2-066 

4-66 

4-04 

1-863 

5-48 

4-20 

2-152 


X-0*291174a^+0*132607xa 


167 


DISCRIMINANT FUNCTIONS 
the purpose of finding the discriminant 

It is necessary to calculate 

«^p=*^p^-fpB=6*465454-5-530909=0-934545 
dd=frf4-XdB=5-323636-4-720000=0-603636 

within races=7-431927-4-803563=2-628364 
within races=3-752727-2-004072=.l*748655 
^[(^p-^p)(a:d-frf)] within races=4‘380073-3-102691-l*277382 
Then 2-628364>p+l-277382Xrf=0-934o45 

l-277382Xp+l-748655Xd=0-603636 
which gives Xp=0*291174 and Xd=0-132507 

So X=0-291174x„+0*132507a:w 

and _ 

/>=^^-^B=XpC/p+Xrfdd=(0‘29U74xO-934545+0*132507xO-603636) 

=0-352101 

The number of strains of each race used in the calculation, 
i.e. 71, is 11. The analysis of variance of X may now be written 
down. 

Item Sum of Me^ Variance 

Squares N Square Ratio Probability 

Between races 0-681863 2 0-340932 18-397 Very smalJ 

Within races 0-352101 19 0 018632 

Total 1-033964 21 


The sum of squares of X between races is 

”i)2=.li(0-352101)2 
2 2 

"0-681863 and the sum of squares within races is Z)"0-352101. 

The calculation of the mean squares and variance ratios 
proceeds as usual, and the probability is found to be very small 
when the table of variance ratios is consulted. There is a good 
racial difference in the value of our discriminant X. 

Misclassification of a strain will occur when its departure 
from the racial mean is greater than half the racial difference, 
provided that the departure occurs in the right direction; for 
in such a case the strain mean will fall nearer to the mean of 
the race to which it does not belong. 

The racial difference in X is 0-352101 and the estimated 
standard deviation of X within races is '\/0-018532»0-13613. 


A deviation of —-2-?5?15i=o-17605 will cause misclassificationj 

2 2 


and such a deviation is 


0-17605 



1-293 times the standard 




158 


STATISTICAL ANALYSIS IN BIOLOGY 


deviation, as estimated from the 19 degrees of freedom. Now 
a f[igj equals or exceeds the value 1*293 by chance in just over 
20% of cases. But for misclassification to occur the deviation 
must be in one given direction. Hence, although such a t value 
may be exceeded in about 20% of cases, misclassification will 
occur in about 10% of cases only. The other 10% of cas^ 
represents the occasions on which the departure of the strain 
mean from the race mean is over half the size of the racial 
difference but is in the direction away from that which causes 
misclassification. 

The misclassification rate of the strains _ of Table 37 
a^ees reasonably well with this expectation. X^=2*58799 and 
Xb=2 2Z5S9. Therefore anything greater than 2*41194 will be 
attributed to race A and anything less to race B. This results 
in 1 out of the 22 strains being misclassified where 2*2 are expected 
to be placed wrongly. We should note also that two other 
strains deviate from the mean by more than 0*17605 in the 
direction of not causing misclassification, 2*2 being expected to 
show such a deviation. 

The same analysis may be made on the data for proximal 
and distal combs separately. With the proximal comb the 
racial difference is 0*934545 and the mean square within races 


?l5????i-0*131418. Note that here there are 20 degrees of 
20 

freedom within races in contrast to the 19 of the discriminant 


analysis. Then V0*1314T8=-1*289, which has 

A 

a probability almost equal to that of the discriminant’s t ^th 
misclassification about 10%. The discriminant enjoys but little 
advantage over the proximal comb used alone. 

The distal comb alone is not so discriminative as either the 
proximal comb or the discriminant X. The racial difference is 
0*603636 and the standard deviation within races 




So with a probability of 30% and hence 

a misclassification frequency of 15%. 

The frequency of misclassification can be found exactly it 
t values are used, as the t distribution makes due allowance for 
sampling errors in the estimation of the standard deviation. 
The use of the normal deviate for this purpose would not accom¬ 
modate such sampling errors and hence would lead to under¬ 
estimation of the misclassification frequency. 


DISCRl 


•oil 


JTANT FTJNCnONS 


159 


If the discriminant is to be used frequently in practice it is 
better to put 

X=3*43437X»a:p+0-4551a:j 

The first-order quantities, such as mean and standard devia¬ 
tion, will then all be 3*43437 times the values obtained when 
X was used and the quadratic quantities, like the variance and 
sums of squares, will be (3*43437)2, i.e. 11-7949 times the value 
given by X. 

A further modification might be considered for very extensive 
use of the discriminant. We could take a new X equal to 
^p+i^d' This would not cause much loss of precision and would 
materially lessen the arithmetic. In order to use this compound 
a fresh standard deviation would be necessary and it could be 
found from the sum of squares within races obtained as 

'S'(a:p-fp)2+2(J)5[(r^-ip)(a:d-:rJ]+(J)2^(a:d-Xd)2 

all deviations being from the racial means. 

In the present example the extensive use of the discriminant 
would hardly be justified as the proximal comb alone supplied 
nearly as much information about racial differences. In other 
cases, however, this is not true and the discriminant has been 
found to give much greater precision of classification than any 
single measurement. Such a case is described in the genus /m, 
where three species were involved, by Fisher, and should be 
consulted for further details of the application of discriminant 
functions to taxonomic problems. 


REFERENCES 

ASHBY, E. 1937. Studies in the inheritance of physiological characters 
—III. Ann. Bot. N.S., 1, 11-41. 

BUCK, j. B. 1937. Studios on the firefly —I. Physiol. ZooL, 10, 45-58. 

FISHER, R, A. 1944. Statistical Methods for Research Workers. Oliver 
and Boyd. Edinburgh. 9th ed. 

*- 1936. The use of multiple measurements in taxonomic problems. 

Ann. Eiigen., 7, 87-104. 

- and YATES, r. 1943. Statistical Tables for Biological, Agriadtural 

and Medical Research. Oliver and Boyd. Edinburgh. 2nd ed. 

VATHER, K., and DOBZHANSKY, TH. 1939. MorphoIogical differences 
between the * races * of Drosophila paeudo-obscura, Amer, Nat., 
73, 6-25. 


CHAPTER X 


CORRELATION 


40. INTER-CLASS CORRELATION 

IN the calculation of a regression coefficient the parts played 
by the dependent variate, y, and the independent variate, a;, are 
very different. Hence the regression of t/ on a; is quite distinct 
from the regression of x on y. This difference is reflected in the 
conditions which must be fulfilled by the variates in order that 
the regression calculation shall be valid. 

It is, however, not infrequently the case that x and y are 
similarly distributed, whereupon the regression of a: on y has as 
much relevance to the problem as has the regression of y on x. 
One example of this is afforded by the coincident measurement 
of some physical character, e.g. stature, in fathers and sons, both 
the fathers and sons being sampled at random from the population. 
With such data it is possible to use the correlation coefficient for 
the purpose of summarization and analysis. 

The correlation coefficient has occupied a very important 
place in statistics, but its use is gradually dying, as the method 
of regression will always offer as good a solution to a problem 
and is very frequently much better. This is especially true where 
x, the independent variate, is not normally distributed, as in the 
case of the analyses shown in Examples 12, 13 and 16. The 
use of the correlation coefficient for such analyses would be 
incorrect. Correlation is still used fairly extensively, chiefly in 
the analysis of observations made on non-experimental material, 
and so some account of its application must be given. In general, 
however, the method of regression is much to be preferred. 

When a variable, x, is normally distributed about the mean 
fx with a standard deviation cr, the frequency cxirve may be 
represented by the formula 

( 1 ) 


m,- 


1 

—e 


<yV27i 


being the frequency expected in the class showing measur^ 
ment x. Similarly, when the two variables, x and y, are ^ch 
normally distributed with standard deviations and ab^t 
the means and /i„, the frequency surface is representable by 

( 2 ) 


m 


’*y 


fi ffy* J 


m being the expected frequency of the double observation (x, y) 

160 



INTER-CLASS CORRELATION 


161 

This formula contains a new quantity p which is called the 
inter-class correlation coefficient of x and y. 

The following properties of p are not difficult to deduce 
mathematically from the formula (2). Suppose that {x-fz^) takes 

the value Xi. Then from formula (1) 

a„V27i 

1 1 r*,» 

from (2) miy---- —-e 2 ii-p)*Lv o^a + a * 

27zcr^cry{l-p^) ^ ^ 

So, when x-ft^^Xi, the frequency distribution of y is representable 
as m„=—i-y-— _ ___ e 2 (i-#>*)a • ^ J 

mi ayV27i{l-p^) 

Thus no matter what particular value x may take, y is always 
normally distributed, and when has the value x^ the mean 

value of y is p. ^p^^. This relation may be re-written as 
which is clearly the formula to a straight regression line with 



By substituting for y it can equally well be shown that x is 
normally distributed when y takes any special value and that the 

regression of a: on y is linear with b^’~p—> On comparing the 

(Ty 

two regression coefficients, that of y on a; and that of x on y, it 
becomes obvious that p is the geometrical mean of the two. It 
can then be shown that p must lie within the range +1 to -1. 

One further conclusion remains to be drawn. When x takes 
a particular value the variance of y becomes ay^{l-p^), i.e. is 
independent of x. So we may say that the correlation of y with 
x accounts for a proportion p^ of the variance of y, the remaining 
portion {!-/>*) being independent of x. 

Example 18. A typical example of data to which the cor¬ 
relation approach is often applied is afforded by the results of 
Roberts and Griffiths’s comparison of two methods of assessing 
the intelligence of all children bom in Bath during the period 
13 to 31 January 1922. 

^Examination was by means of both the Binet and the Otis 
tests, the former giving an I.Q. and the latter an I.B. value for 
each child. The results of testing these 66 children by both 
methods are given in Table 38. 




162 


STATISTICAL ANALYSIS IN BIOLOGY 


TABLE 38 


The I.Q. and I.B. scores of 65 Bath children {Roberts and Griffiths) 


I.Q. {X) 

I.B. (y) 

I-Q. (X) 

I.B. (y) 

I.Q. (X) 

I.B. (y) 

67 

36 

91 

91 

108 

91 

70 

28 

91 



111 

72 

34 

92 

92 

108 

115 

14: 

28 

92 

98 

109 

134 

75 

48 

94 

116 

110 

113 

76 


95 

80 

110 

124 

77 

62 

96 

96 

110 


78 

22 

96 

108 

110 

140 

81 

82 

96 

146 

112 

145 

82 

84 

97 

118 

113 

147 

83 

64 

97 

121 

114 

126 

83 

77 

99 

106 

114 

132 

83 

82 


79 

116 

142 

84 

92 

101 

103 

116 

157 

85 

91 

101 

113 

116 

126 

86 

65 

101 

118 

116 

138 

86 

75 

101 

119 

123 

149 

86 

76 

101 

141 

126 

142 

87 

68 

103 

116 

126 

164 

89 


103 

131 

127 

172 

89 


103 

139 

136 

166 

91 

72 

107 

102 

1 

Total 6,366 

6,739 


If we wished to find the regression of I,B., which may be 
called y, on I.Q., which may be called x, we should calculate 

L _ S[y{x-x)] _ S[{x-x){y-y)] 

Similarly, the regression of a; on y would be found as 

h S'[(a;-f)(y-y)] 

* 'Sr(y-y)* 5(y-y)2 

Our estimate, r, of the correlation coefficient p is the geometric 
mean of 6, and b^, so that 

/S[{x-x){y~y)] S[{x-x){y-y)]^ _ S[{x-x){y-y)] 

V S{x-x)^ ‘ ^(y-y)=* ViS(a:-f)*iSr(y-y)2 

On dividing both numerator and denominator by the number of 
degrees of freedom it is apparent that the equation of estimation 
may also be written 


W 


*y 




where is the covariance of x and y and 5, and Sy are the 
estimates of the two standard deviations. 






163 


nTTER-CLASS CORRELATION 

Thus the calculation of r necessitates our knowing the sums 
of squares and the sum of cross-products of x and y. From 
Table 38 we find 


Then 


^(a:)=6,366 


8\x) 


n 


i5(y)=6,739 

=638,6500000-623,476*2462 


S^(y) 


15,173*7538 


S(y-yY^S(y^) -^=781,685*0000-698,678*7846 


n 


=83,006*2154 

'S'[(a:-i)(y-^)]=>S(a;j/)-^^?l^^=691,449*0000-660,007*2923 


and 


S[{x-x){y-y)\ 


31,441*7077 

31,441*7077 


^S(x-x)^.8{y~y)^ VlS,173*7583x83,006*2154 

31,441*7077 


35,489*6587 

=0*885940 

We may note that the regression of x on y is 

31 , 441 . 707 _ 7 ^ 0 . 37873 , 

® 83,006-2154 

And the regression of y on a; 

3M4^7_2.072111 

" 15,173-7538 


As a check on the arithmetic 


r=\/6,6j,= V0*‘378787x2*072111=0*885940 

Taken together, the two regressions give more information 
than does the correlation coefficient. When 6j.=0*378787 we 
know that for every unit advance in I.B. an advance of 0*378787 
may be expected in I.Q. ; and when 6^-2*072111 we know that 
for every unit advance in I.Q. an advance of 2*072111 is to be 
expected in I.B. Such forecasting is not possible when only the 
correlation coefficient is given. 

With r of moderate or small size and the size of sample, 
w, large, r is distributed normally about the parameter p of 

(l_p2)a 

which it is an estimate. Its variance, F,, is —— in such 

n-\ 

cases. This is not, however, to be recommended as a basis for 
the calculation of a test of significance using a normal deviate, 
as n is seldom sufficiently large and r may not be small or even 



164 


STATISTICAL ANALYSIS IN BIOLOGY 


moderate in size. A t test may be employed instead, with 

s =. = /HzLJ as the denominator, t having n-2 degrees of 

* ^ y n-2 


freedom. Thui to test the significance of the deviation 
r from 0, the value expected if x and y are unrelated, 




In the data of Table 38, r=0-885940 and so r2»0-784890, 
1-7-2=o-215110 and Vl-r^=0*463800. Furthermore, 

v'n^=\/^=7-9373 


^ rVw^2 0-8859x7-9373 7-0140 u uiv . 

So ^ 681 =—7^=»---= 15-162 and the probability 

Vl-r*** 0-4638 0-4638 ^ 

of obtaining such a value by random sampling of an uncoirelated 


population is very small. 

Fisher has devised another method of testing the significance 
of a correlation coefficient. It involves the use of what is called 
a transformed correlation, as it depends on the properties of 
z where 


z-iPoge (1 -f r)-Iog, (1-r)] 

When the two variates x and y are independent in distribution, 
^-r=0. When x and y are fully correlated, i.e. r=±l, z is ve^ 
large. No matter what its value, z is very nearly normally dis¬ 
tributed for all values of n, its standard deviation being ■—===. 

vn-3 

This is the true standard deviation a*, not the estimate as it 
is not found from the sum of squares of z in the data. Con¬ 
sequently the ratio of the deviation of z, from any expected value, 
to its standard deviation is a normal deviate, c, not a t, and 
may be tested by use of the table of normal deviates. 

In the case of Roberts and Griffiths’s data r-0-8859. So 


and 


l+r-1-8859, logio (l-«-r)=0-2756 
l-r=0-1141, logio (l-r)=I*0573 
logic {l+r)-logio (l-r)=l-2183 
Poge (l+r)-loge (1-r)] 
-1-1513 [logic (l+r)-logio (1-r)] 
-1-4026 


n-65, so 


3 7-8740 


and c--“1*4026x7-8740-11-044 with a very small probability. 
The normal deviate given by z is 11-044 while t, which for 



THE COMBINATION OF COEFFICIENTS 


1G5 


63 degrees of freedom is almost the same as a normal deviate. 
18 15162. The two methods of testing the same hypothesis do 
not give exactly the same result. Clearly the two tests are not 
exactly equivalent, but each is testing the hypothesis in its own 
characteristic way. The discrepancy between the two is not 
difficult to explain. The hypothesis has been rendered unlikely 
by both of the tests of significance. Now the two are testing the 
hypothesis in different ways, and so if the hypothesis is untrue 
their connecting link disappears. Hence when the hypothesis is 
rendered unlikely two such tests are not expected to give similar 
results. Where, however, the hypothesis in question is not 
rendered improbable by the tests, the two should give closely 
approximating results. Departure from hypothesis and dis¬ 
crepancy between valid tests of significance go hand-in-hand. 
A different example of this same phenomenon has been recorded 
by Mather in connexion with the l^t of significance. 


41. THE COMBINATION OF INTER CLASS CORRELATION 

COEFFICIENTS 

The transformed correlation coefficient, z, is of value in 
another way, viz. in combining two or more correlation coefficients. 
Example 19. Roberts and Griffiths also give data on the 

1. Q.s and I.B.s of children born in Bath in the months of January 
1923 and January 1924. In the following discussion the 1922 
children will be referred to as Group 1, the 1923 children as 
Group 2 and the 1924 children as Group 3. Table 39 gives the 
correlation coefficients, r, and also the transformed correlations, 

2, for the three groups. As z is nearly normally distributed it 
may be used for combining these three estimates of p to give 
the best joint estimate. 


TABLE 39 


The Combination of Correlation Coefficients (Roberts ajxd Griffiths) 


Group of 
Children 

f 

f 

n 

Ir 

( = n-3) 

V 

1 

0-8869 

1-4026 

65 

62 

86-9612 

2 

0-9257 

1-6274 

60 

67 

92-7618 

3 

0-8749 

1-3537 

67 

64 

86-6368 


Total 

1 

183 

266-3598 


S{I^) 


" S(Iz) 

r-0-8968 


-1-4555 



1 1 
"iS{/,)"l83 





166 


STATISTICAL ANALYSIS IN BIOLOGY 


Now the three groups contain different numbers of individuals 
so that the precisions of the three estimates are different. This 
clearly must affect the procedure of combination, because an 
estimate of greater precision will be of more value in pointing 
to the best estimate. It must be given a greater weight in the 
calculation. This raises the problem of the measure of precision 
of an estimate. 

As the variance of any distribution decreases, the chance of 
finding a deviation from the mean of any given magnitude 
decreases, too. In other words, the precision of our knowledge 
of the distribution has increased and we can predict the behaviour 
of single observations with greater accuracy. So the precision 
must obviously be related to the variance. Indeed, the reciprocal 
of the variance immediately suggests itself as a measure of 
precision. But it is equally true to say that as the standard 
deviation of the distribution decreases, the precision of our 
knowledge of the behaviour of a single observation also increases. 
Thus we might, on the face of it, suppose that the reciprocal of 
the standard deviation also provided a good measure of precision. 
How are we to decide between these two possible measures ? 

Let us consider a specific case, viz. the estimation of a mean. 
The variance of the mean of a distribution is obtained as the 
ratio of the distribution’s variance to the number of observations. 
When the number of observations is doubled the variance of 
their mean is, on the average, halved. The standard error of 

the mean, on the other hand, is divided by ^2. Now if the 
reciprocal of the variance is taken as the measure of precision, 
the precision value characteristic of the mean x based on 

yix observations will be i.e. the reciprocal of —where F* is 

F, n, 

the variance of the distribution whose mean is x. The precision 
of a mean based on observations is similarly and that of 

^ 9 

the mean of n^+nt observations ^ * , The precision of the joint 

y g 

mean is equal to the sum of the precisions of the two individual 
means. 

Where, however, precision is measmed by the reciprocal of 
the standard deviation this simple additive nile does not hold, 
as the three precisions would clearly be in the ratio 

Vtix'. Vn,: Vni+n, 

Hence the reciprocal of the variance is a much more convenient 
measure. When used in this way the reciprocal of the variance, 
or invariance, is called the amount of information and is denoted 
by the letter 7. The choice of this measure of precision will be 



PAKTIAL COBRELATION 167 

^soussed in more detail in relation to the theory of estimation 
in a later chapter. 

The amounts of information about the three z values of 
Table 39 are easily found from the reciprocals of the variances 

of each of the z’s. and so These, then, are 

the weights to be applied to the z’s in calculating the best joint 
estimate, which will be found in the form of what is commonly 
called a weighted mean of the three values. Each z is multiplied 
by its characteristic weight and the three products are summed, 
due weight thus being given to each estimate in the joint sum! 
Supposing, however, that all the estimates had been identical. 
Then the best joint estimate would also be the same as each of 
the three individuals. Now if we had arrived at a weighted sum 
in the above way it would clearly be necessary to divide the sum 
by the sum of the weights in order to recover this best estimate. 
This is also the correct procedure in the case where the three 
individual estimates are not all alike. Thus the formula for 
finding a weighted mean, z, is 

Sih) 

In our case /^-(7i-3)z and 5{7^)=(62xl-4026)+(57x 1-6274) 

+(64xl-3537)-26'6-3598 

<S(/,)-62+57+64-183 



. 266-3598 



= 1-4555 


and 


,-=| 5 !^|= 0-8968 

(e^*+l) 


The variance of z is also found from the amounts of informa¬ 
tion. It will be recalled from our discussion of the precision of 
a mean that information, as defined in our present way, is additive. 

Hence 


183 and so 


42. PARTIAL CORRELATION 

It has already been shown that correlation is a modification 
of linear regression and in consequence we may reasonably 
expect that any operation involving linear regression analysis 
will have a counterpart in the correlation method. So we should 
anticipate that multiple, or, as it is often called, partial regression, 
can be reproduced in the correlation technique, and indeed this 
is so. The method in question is termed partial correlation. 

Suppose that we have taken three measurements w, x and y, 
on each of n individuals. A correlation coefficient relating any 


168 


STATISTICAL AKALYSIS IN BIOLOGY 


pair can be found. There will be three such quantities, 

and Ty^. From these three the partial correlation coefficient, 
'^xy.vD^ which measures the correlation between x and y when due 
allowance is made for the effect of Wy can be found by using the 
formula 

T —»• r 

m ^ ^ yw 

The method can be extended to cases involving more than 
three variables, but the arithmetic labour increases enormously 
as the number of variates becomes larger. 

The significance of partial correlation is tested in exactly the 
same way as that used for simple correlation coefficients, but with 
an additional degree of freedom deducted for every variate 
ehminated. In the example with three variates, considered 
above, one degree of freedom would be lost when the variate w was 
eliminated. Thus if there were n observations, t testing 
would take w-3 degrees of freedom rather than n-2 as used in 
the t test of a simple coefficient. When using the transformed 

correlation, z, the variance of z„„ would be not —, as in 

71-4 71-3 

the case of z*„. 

Data of this kind are, however, more profitably treated by 
the use of multiple regression. 

43. INTRA.CLASS CORRELATION 

In the example of a correlation coefficient that was considered 
in Section 40 the two variates, I.Q. and I.B., obtained from 
intelligence tests, differed quite markedly from one another. 
Their means were similar but their variances were very different. 

It does, however, happen at times that the data requiring 
analysis concern the joint distribution of two like quantities. 
Such a case could be, for example, a set of measurements on 
pairs of fully grown brothers. If these were always separable 
into, say, elder and younger it would be possible to treat them 
by the correlation method already described; but if such 
a separation were not possible a different type of correlation 
coefficient, denoted as intra^class, would be necessary. 

The intra-class correlation coefficient differs from the inter- 
class type in that during its calculation the two variates, x and y, 
are assumed to have the same mean and variance. No other 
assumption is possible if the x and y measurements, as pairs, are 
not separable. Even where separation is possible, the intra-olass 
correlation is the more precise method, provided, of course, that 
the two measurements are of the same kind and might be expected 
to follow the same distribution. 




170 


STATISTICAL ANALYSIS IN BIOLOGY 


The labour involved in such a calculation becomes very great 
as the number of observations in each set increases, but an easier 
approach to the analysis is possible. This is afforded by the 
analysis of variance, which incidentally also shows why the 
2 transformation may be used in testing the significance of 
correlation coefficients. 

Let us consider once more the case of n pairs of measurements, 
each pair including an x and an x' measurement. There are 
2n measurements in all and hence 2n-l degrees of freedom between 
them. These may be partitioned into three groups. First, 
there will be 1 degree of freedom for the difference between the 
means of x and x'. Secondly, there will be n—\ degrees of 
freedom for differences between the pairs of observations. 
Finally, there will be n-1 degrees of freedom for the variation 
of the difference {x-x') between the various pairs of observations. 
The analysis is, in fact, exactly the same as that used in the 
tomato example of Section 24. 

If X and x' were separable, as would be the case with inter- 
class correlation data, the analysis could be made completely. 
If, however, x and x' are not separable, as in the present case, 
the analysis must be incomplete and the item for the difference 
between x and x' pooled with the third portion of the variance, 
viz. that concerning variation in the difference between x and x' in 
the different pairs of observations. In both complete and in¬ 
complete analyses the portion of the variance ascribable to 
differences between pairs of observations is capable of being 
isolated. 

When X and a:' are independent 

but if X and x' are correlated this relation is modified by the 
inclusion of a covariance term which reduces the value of 
when the correlation is negative and increases it when the correla¬ 
tion is positive. So the value of the mean square between pairs 
of observations will vary according to the type of correlation 
between x and x\ If it does not differ significantly from the 
error term, based on the variation of {x-x'), there is no evidence 
of correlation. If it is significantly higher than the error mean 
square the correlation is positive. If it is lower than the error 
mean square the correlation is negative. 

The test of significance consists, of course, in the calculation 
of a 2 which is half the natural logarithm of the ratio of the two 
mean squares. The use of z as a transformation of the correlation 
coefficient for the purpose of testing significance is no longer 
mysterious. 

The essential difference between inter- and intra-class correla¬ 
tion when viewed from the standpoint of the analysis of variance 



INTRA-CLASS CORRELATION 171 

lies in the completeness of the partition in the analysis to which 
they lead. It must be added, however, that the analysis of 
variance may not always be applicable to inter-class correlation 
problems because marked differences in distribution of the two 
variates may be found. It is always appropriate to intra-class 
correlation analyses, which are based on the assumption that no 
such distributional differences exist. 

We have already seen that the analysis of variance is a very 
general method and so it is clear that groups of more than two 
observations can be handled in this way. The following data 
from Mather and Lamm’s accoimt of the frequency of chiasma 
formation in rye chromosomes shows an application of the 
analysis to a case where each group consisted of seven observations. 

Example 20. The majority of rye plants have seven pairs of 
chromosomes, i.e. bivalents, at meiosis, each of them forming 
from one to four chiasmata. The seven bivalents cannot be 
regularly distinguished from one another by inspection, and so 
the distribution of chiasma frequencies for any single bivalent 
cannot be found. It may, however, be assumed that all the 
bivalents have the same chiasma characteristics, whereupon the 
distribution of chiasma frequencies can be found by counting 
the chiasmata in each of the seven bivalents and treating each 
bivalent as a sample observation of the joint chiasma frequency 
distribution. Table 40 gives the results of such counts on the 
seven bivalents in each of 35 nuclei of a particular rye plant. 

TABLE 40 


The Frequency of Chiasma Formation in a Rye Plant (Mather and Lamm) 


Number of chiasmata per bivalent 

1 

2 3 

4 

Total 

Number of bivalents .... 

4 

150 89 

2 

245 


Number of chiasmata per nuclou.s . 

14 

15 16 17 

18 

Total 

Number of nuclei. 

1 

3 12 14 

5 

35 


In addition, the total number of chiasmata in each of the 
35 nuclei was determined, it being, of course, the sum of the 
chiasmata in the seven constituent bivalents. These nuclear 
totals are also given in Table 40. The existence of a correlation 
between the numbers of chiasmata in the various bivalents of 
the same nucleus can be tested by an analysis of variance of 
these chiasma frequencies. 

There are 245 observations in all, and so the total of degrees 
of freedom is 244. The corresponding total sum of squares of 
deviations from the general mean is found by summing the squares 



172 


STATISTICAL ANALYSIS IN BIOLOGY 


of the numbers of chiasmata and subtracting the correctiou 
term, which will clearly be The calculation is thus 

5792 

[4x(P)+150x(22)+89x(32)+2x(42)]-^ 

= l,437 0000-l,368-3306=68-6694 

This total must be partitioned into two items, one for differences 
between nuclei and one for differences within nuclei. The former 
will have 34 degrees of freedom, as there are 35 nuclei, leaving 
210 for the latter item. 

The sum of squares between nuclei is obtainable as 

579* 

|[lx(14*)+3x(16*)+12x(16*)+14x(17*)+5x(18*)]-— 

= 1,372-7143-1,368-3306=4-3837 

The divisor 7 is used because each nuclear total is the sum of 
seven individual chiasma frequencies. The sum of square 
within nuclei is then obtainable by subtraction, giving as tne 


analysis: 

Sum of 

TABLE 

; 41 

Mean 

Item 

squares 

N 

square 

Between nuclei 

4-3837 

34 

0-1289 

Within nuclei 

64-2857 

210 

0-3061 

Total 

68-6694 

244 



Varisnce ratio 
2-3747, with a prob¬ 
ability of about O OOi 


The seven bivalents ot the nucleus are nor 
that the analysis of variance is of necessity incomplete, 
sum of squares within nuclei contains an item, for 6 degrees 
freedom, dependent on differences between the chiasma 

quencies of the seven bivalents. * 

The significance of the difference between the two ite 
are actually isolated can be tested by calculating the vana 

2-3747. The larger mean square has 210 degrees 


ratio 


0-3061 


of freedom and the smaUer 34, so that the variance ratio tau 
entered with N,=210 and The probability is ^ 

be about 0-001, so indicating a genuine discrepancy m size 
two mean squares. The same test could, of course, a 
equally well performed by taking 

log* (V.R.)-M513 logio (V.R-)-M613 logic 2-3747- * ^ 

which, on entering in a table of 2 with iV'i-210 and 
exactly the same probabiUty as that foxmd when the 
ratio itself was used. 



INTRA-CLASS CORRELATION 


173 


The fact that the difference between the two mean squares 
is significant shows that a correlation in chiasma frequency exists 
between the seven bivalents of a nucleiis. Furthermore, this 
correlation must be negative as the mean square between nuclei 
is lower than that within nuclei. 

The intra-class correlation can be calculated from the data of 
Table 41, as it can be shown that the total sum of squares equals 
nfcFa, and the sum of squares between nuclei is nV 
where there are n sets (here 35) each of k observations (here 1), 
r being the correlation coefficient. Then the ratio of the sum of 
squares between nuclei to the total sum of squares must be 
l-t-(A;-l)r 

-^-. Arithmetically, from Table 41, this gives 

^ 1 ^^= 0 - 06384 =^ 

7 68-6694 7 

and r-J(006384-0-1429)=-0-09219. 

This calculation also brings out one final property distinguish¬ 
ing intra- from inter-class coefficients. The latter may range, as 
we have seen, from -1 to -hi. An intra-class correlation also has 
the upper limit of 1, but the lower limit is easily shown to be 

attained when the sum of squares between nuclei, i.e. 

nVg.[l+{k-l)r]y takes its minimum value of 0. In this case 
l+f/t—Hr 

—~—^-0 and {A;-l)r=-l. So even when a negative intra-class 

rC 

coefficient is found to exist, its actual value is not of much use 
in determining the strength of the relation of which the correlation 
is a reflexion. 


REFERENCES 

fishek, r. a. 1944. Statistical Methods for Research Workers. Oliver 
and Boyd. Edinburgh. 9th ed. 

MATHER, K. 1940. The design and significance of synergic action tests. 
J. Hyg.y 40, 613-31. 

— and LAMM, B. 1936. The negative correlation of chiasma frequen¬ 
cies. Hereditaa, 20, 66-70. 

BOBBRTS, J. A. FBASEB, and GRIFFITHS, R. 1937. Studies on a child 
population. Ann. Eugen.y 8, 15-46. 


12 



CHAPTER XI 


THE ANALYSIS OF FREQUENCY DATA 

44. X* AND THE NORMAL DEVIATE 

THE data resulting firom observations or experiments are 
commonly of two types, which may for convenience be termed 
measurements and frequencies. In the former type each in¬ 
dividual is characterized by a measurement of some kind. Most 
of the data so far considered have been of this type. Thus in 
Example 12 each observation gave the amount of Rb taken up 
by potato slices in a given time, and in Example 6 each tomato 
plant was represented by its yield in the data for analysis. In 
such cases the distributions of the various measurements must be 
reconstructed in a suitable manner from the experimental results 
themselves. This means that the variances used in the final 
tests of significance are estimated from observations. The 
t and z distributions are, as we have seen, appropriate to analyses 
of this kind. We may note that the maize data of Example 1 
were of this type, though they had been recast into the form of 
a frequency distribution in order to lighten the arithmetic 
computation. 

In other types of biological research the results of observation 
and experiment take the form of frequencies with which in¬ 
dividuals fall into certain distinct classes. This is commonly the 
case with genetical experiments, such as that of Sirks discussed in 
Example 5, where two classes of plant, with coloured and white 
flowers, were recogmzed. With frequency data of this kind the 
hypothesis under test fix;es the variance of the distribution 
expected and in consequence the test of significance will be 
made using a normal deviate or 

Both the normal deviate and x* are found by comparing 
observed deviations, or squared deviations, with standard devia¬ 
tions, or variances, fixed by hypothesis. As has been pointed out 
in Chapter IV, they represent limits to t and z, in which quantities 
the variance or standard deviation forming the denominator is 
estimated from the data. The normal deviate may also be 
regarded as a special case of in that its numerator consists of 
a single deviation while that of may depend on any number of 
independent comparisons. This relation can perhaps be best 
seen from a further consideration of Sirks*s results. 

One cautionary remark must be made here. Both x* 
the normal deviate are derived from the continuous normal 
distribution, whereas frequency data are discontinuous. In con- 

174 


AND THE NORMAL DEVIATE 


175 


sequence an error is introduced, but it will not be large unless the 
expected class frequencies are small. Neither nor the normal 
deviate should ever be used where any class frequency has an 
expectation of 5 or less. Yates*s correction for continuity 
(Section 19) helps to overcome this trouble, but the above rule 
is a sound safeguard against the serious overestimation of 
significance. 

Example 21. It will be recalled from Example 5 that on self- 
pollinating a Datura plant heterozygous for the gene P,p, Sirks 
obtained a family containing 69 coloured (P) and 14 white (p) 
flowered plants. Mendelian theory leads us to expect that, on 
the average, } of the individuals would be coloured and the 
remaining J white ; but it also tells us, and this is an important 
point, that the distribution of the observed frequencies of coloured 
individuals in such families wiU have the characteristic binomial 
variance oip{\-p)n, where p and (1-p) are the probabilities that 
a given individual will be coloured or white, i.e. J and J respec¬ 
tively, and n is the number of individuals in the family, in this 
case 73. 


So we expect to find Jx73, i.e. 54-75, coloured plants and 
Jx73, i.e. 18-25, white plants in such a family. Furthermore, 
the variance of the number of coloureds or whites is expected 
to be jxjx73, i.e. 13-6875. 

The deviation of the observed frequency of coloureds from 
that expected is 59-54-75 or 4*25. This may be compared with 
the standard deviation, found as the square root of the variance. 


to give a normal deviate in the form 


4-25 


4*25 


M49. 


'v/13-6875 3-6997 
Consultation of a table of normal deviates (Table I) shows that 
the probability of finding a fit as bad or worse is between 0-26 
and 0-25. 

The same result could be obtained equally well using x^- 
For this calculation it is necessary to compare the square of the 
observed deviation with the expected variance, i.e. 


x\u 


(4-25)* 18-0625 
13-6876"l3-6875 


1-3196-1-149 



This x^ will have 1 degree of freedom because the numerator is 
based on one comparison between the data. A normal deviate 
is thus the square root of a x^ ^br 1 degree of freedom, a relation 
which is exactly the same as that holding between t and the 
variance ratio. The table of x^ (Table III) shows that the 
probability of finding a fit with hypothesis at least as bad as 
that observed lies between 0-30 and 0-20. The normal deviate 
permitted the probability to be stated with greater accuracy 
because the table of normal deviates, being one-dimensional, is 



176 


STATISTICAL ANALYSIS IN BIOLOGY 


more closely computed than the two-dimensional table. This 
extra precision is, however, seldom of real importance in under¬ 
standing the implications of experimental data. 

It may be noted that Yates’s correction for continuity could 
have been applied in the calculation of in exactly the same 
way as it was used in Example 6. 


46. THE VARIOUS FORMS OF x* 

The normal deviate is confined to cases like the one above 
where only one observed deviation is to be tested, but x'‘‘ is of 
wider use when its general form is known. 

Where only a single deviation is to be tested, x^* ^ have 
seen above, can be found from the formula 


2 _ {ai-min)^Jat-7n^n)^ 

^ Tfijm^n miTritn 

where n is the family size, lUi and nit the relative proportions 
expected to fall into the two classes, and Oi and at the numbers 
observed in the two classes. It should be noted that Wi-t-/»*■=! 
and ai+at=n. The symbols mj and wi, replace p and (l-p) for 
reasons which will become apparent when more complicated data 
are under discussion. In Sirks’s case ai“59, ai“14, 7i=*ai-Hfli=73, 
m 2 =i and mj-(-ma=l. Then 


2 {a^-m^nY 
X tir 



as already used 


m^mtn fxjx73 

This formula for x^ is incapable of extension to more complex 
cases, but it can be recast into a form from which the general 
formula becomes obvious. 


, (ai-miTi)* (ai-miOi-mjaj)* 

X nr- 




m.m.rL 


since n-aj+a* 

since ^ , .. -ont 

We may pause here to note that this formula or its vananx; 

where is very convenient for the calculation 
of x^> testing the fit of an observed two-class segregation with 


• Yates’s correction for continuity results in the value, which is squared 
to eivo the numerator of this fraction, being reduced by 1 if positive or 
-1 if negative, i.e. if it is (m,Oi-miO,-l), or if t 

10 (rnjC^+ 




THE VABIOIJS FORMS OF 


177 


the expected ratio : m*, or Z: 1 where I 


m. 


m 


It is, as will be 


2 


seen later, widely used in analyses of this kind. 

Resuming the general discussion, a further transformation 
may be made: 

^ rriim^n 

= — ^ —(Wg®ai*-2mim2aiaa+mi2aj2) 

m^mtn 

Since l+mi^+mjm2~2mi=mt 

l+miWa+ma2_2m2=mi 

and mi+ma-2=-l 

— ^ —[ma(Zi2M^.^^2+^^^^_27ni)+7nia2*(l+mim2+m2*-2ma) 

^71 

+2mim2a,a*(m,+7718-2)] 

By expansion and rearrangement 

— ^ 2771,77180.2771.7712aia8+77i,^2®i*+277ii®77i2a,aB 
TTliTTlgTl- 

+77ii277i2a2®+7nia2*-277ii77i2a2^-277ii77i2aia*+77ii77i2^ai® 

+ 2771 iTTl 2® a lO 2+771 |77l 2 * *] 

(since Oi+Oa-Ti) 

-- ^ -[77l2ai®-277li777*7lOi+77li®77l27l2+m,a2®-27?li77l27102+77li77l2®7l*] 

TTljTTljTl 

1 r V / xon (ai-77ii7i)2 (at-m^ny 

[77lt(ai-Wi7l)*+77li(a2-77l27l)^]“'-+ -- 


v2 

X [1] 


x\i\ 


m^m^n 


771.71 


771,71 


This formula has two terms each dependent on one of the 
two classes into which the family was divided. Applying it to 
Sirks’s data we find : 


Z*III 


( 


59 


'3x73 ^3 

4 


3196 as before 


It is easy to see that this formula can be extended to any 
number of classes in the form 

_ (ai-TO,TO)» ^(g.- m .TO)»^ _ , (aj-mjn)* 

* 771,71. TTlaTl TTljTl 

where a,+a 2 + . . . +0^-71 and 77ii+77i2+ • • - +7?ij-l 
or more compactly 



178 


STATISTICAL ANALYSIS IN BIOLOGY 


This formula in its turn may be re-written in a form more 
convenient for calculation as 



46. PARTITIONING x* 

The use of these formulae allows a to be calculated to test 
the fit with expectation of any number of classes in a set of data. 
It is, however, clear that the value of will not be independent 
of the number of classes concerned. It will, in fact, have a 
characteristic number of degrees of freedom, as the following 
considerations show. 

We have seen that a. x* testing the fit of two observed fre¬ 
quencies and Uj with the expected values, mjn and wiiW, can 
be written in the form 


2 {fnzdi—TfliQ'i) 

X [ir* 


m^mtn 


This is obviously dependent on one comparison between the 
observed frequencies, viz. (m,ai-m,a,), and so has 1 degree of 
freedom. 

The next simplest case is that of individuals falling into three 
classes. Let the expected proportions be wii, wig and so that 
fni+mj+m 3 =l, and the numbers observed be correspondingly 
ttj, a, and a„ the total being n. Applying the general formula 


rrixTi m^n min 


-n 


But suppose we pool classes 1 and 2 to give a two-class segregation 
having the observed frequencies (ai+a,): a, and expectations 
(wii-i-m,)?!: min. A x^ for 1 degree of freedom testing the agree¬ 
ment of observations and expectation in this new two-class 
segregation can be calculated from the formula 


The derivation of this from the general two-class formula requires 
no explanation. 





PARTITIONING 179 

The formula for the total must include this item and also 
another which may reasonably be supposed to be concerned with 
the separation of classes 1 and 2. By subtraction, this second 
portion is 


\min 171^71 / 


[m3(ai+aa)-{Wi+»ia)a3]2 


(since ai+a 2 +a,=n) 

1 


niyin^in^n 


[ai*m8m5+aa*miWi3+a3^m,ma-(ai+a*+a3)*w,m2Wts] 


[(ai+a3)*m3“-2(ai+a8)a3(mi+m2)m3+a,*(mi+m3)*] 


mz(mx^mi)n 


mim3m3(m,+mj)n 


1 


mimjm3(wi+Wj)n 


[ai*m2m3(l-mi)(m,+m,)+ai*mjt?i,(l-m,){mi+m3) 

+a32mim2(l-ms)(mi+W2)-ai*mimam3* 

-2(aia*+a,a3+«2fl3)«iiWi2W3(wii+mi) 

-2aia2tnim2Wi32+2aia3mim2ms(mi+m2) 

[a,*m2m3(miW2+ma*+wiim3+Wam8-mim3) 


+aa®mim3(?7ti2+mjW3+mijn2+wi2Wi3-wiam8) 

-2aiaa(mi*mama+WiW,*m3+mim2m,*)] 


(since wii+ma+ms^l) 
1 


mim*m3(wi+ma)n 
(oima-aami)^ (maa,-miaa)* 
wijma(mi+Wa)» m^mxn' 


[ai*ma®wij-2aiaamimama+aa*mj^,] 


where 7i"'-n(mi+ma), i.e. is the total of individuals expected to fall 
into either class 1 or class 2. 

This is clearly the formula of a for 1 degree of freedom 
based solely on the comparison between classes 1 and 2. The 
only point requiring special note is that n', the effective total 
number of observations in classes 1 and 2 jointly, is the total 
number expected, not the total number observed. 

So we see that the calculated from a three-class segregation 
is divisible into two independent x^*^ each for 1 degree of freedom. 
The total x^ Is thus based on two independent comparisons and 
has 2 degrees of freedom. 

The subdivision of the total x^ could be made in two other 
^ays according to whether classes 1 and 3 or classes 2 and 3 are 



180 


STATISTICAL ANALYSIS IN BIOLOGY 


pooled for the purpose of analysis. The three possible partitions 
are: 



(w2a,-m,a.)2 


(ii) 

(hi) 


miW 371 (^ 1 + 7 / 19 ) 
(7ra3aa-7n2a3)^ 

7712771371 ( 7 ^ 2 +^ 3 ) 


and 

and 

and 


[77^3(ai+g2)-(777i+77l2)a3]* 

( 77 ^ 1 + 7712)771371 

[77i2(qi+a3)-(77ii+77t3)aa]^ 

( 7711 + 7773)771271 

[77li(a2+a3)-(7772+7773)ai]^ 

(7772+7773)77li7l 


There is thus scope for analysis of components testing 

those specific comparisons which the hypothesis in question 
suggests, exactly as was the case with sums of squares in 
Section 22, 

The same type of analysis by subtraction can be applied to 
the case of a four-class segregation, and it can be shown that the 
total x^ 3 degrees of freedom, being subdivisible into three 
independent components each based on a single comparison and 
hence each having one degree of freedom. This partition can 
be made in a number of ways of which two examples are 



(TTliUi-TTliaj)^ [7773(ai+ff2)-(777i+77i2)a3]* 

777i7772n(77li+77l2)* 77l3(777i+77la)7l(777i+77i3+7778) 

and 

777 4 ( 7771 + 7773 + 7773 ) 71 . 



(777aOi-777ia3)^ 
777i777 277 (777i+7778)* 


( 777304 - 777403 )* 

7773777477 ( 7773 + 7774 ) 


[(7773+7774)(Oi+02)-(777i+777a)(Oa+fl4)]* 
(777i+777j)(7773+ 777 4)77 

A third type of partition of a 3 degree of freedom x* which is 
much used in genetical analysis will be developed later. 

It will be seen from the foregoing that a compound x^ 
be partitioned into simple components each dependent on a single 
comparison and each taking 1 degree of freedom, in the same 
way that sums of squares are analysed. As with sums of squares, 
too, the single ;^*’s are found from a comparison which is squared 
and divided by a characteristic divisor. With sums of squares 
(Section 23) the conditions which must be fulfilled for a successful 
partition are : 

(i) In the comparison, <S'(X;)-0 

(ii) The divisor is jS(A:*) 

(iii) The comparisons are independent if 

Partitioning ;^* is a somewhat more complex operation because 
the frequencies which are to be used in making the compansons 
are not always expected to be equal, whereas in the analysis of 




PABTITIONING 181 

variance all the measurements have the same expectation. The 
complication introduced by this inequality of expectation is not, 
however, very great. The three conditions necessary for a 
successful partition are modified to 

(i) In the comparison, 

(ii) The divisor is nS{mk^) 

(iii) The comparisons are independent if S{mkiki)=0 

As to the divisor, it may be noted that since compares an 
observed deviation with the theoretical variance, nS{mk^) is 
clearly V^, where x=S(ak). This comparison, vanishes when 
a takes its expected value of mn. 

Applying these three tests to the partition of a 2 degree of 
freedom into the two simple components 


(i) 




and (ii) 


[m3(a,+a2)-(m,+m2)a3] 

(7/li+7M2)W3n 


a 


and in 
and so 


miW2w(mj+m8) 

we have in (i) A: 3=0 

(ii) ^'^,=W3 I*32=m3 I*23=-("il+«i2) 

8(mki)=m^mi-m2m^=0 

iSi(mA:i*)=miWa^+m2mi*=mjW2(mi+m2) 

giving as the divisor m-^m 2 n(m^-^m 2 ) 

<S'(m^a*)=mim3*+m2m3*+7n3(mi+m3)® 

"m3(Wi+m*)(m,+Wa+m3)=Ws(mi+ma) 


giving as the divisor (Wi+nijlman 
The partition satisfies all three conditions. 

The application of the rules of partition to more complex 
cases can be illustrated by a genetical example. 

Example 22. If a plant, which is heterozygous for two genes, 
A,a and B,b, is self-polUnated, the progeny is expected to fall 
into four phenotypic classes, AB, Ab, aB, ab. With no disturb¬ 
ance of the segregation each gene is expected to show a 3 : 1 ratio, 
i.e. (AB+Ab) should be three times as frequent as (aB+ab) and 
(AB+aB) three times as frequent as (Ab+ab). Furthermore, 
with unlinked genes their segregations are independent, i.e. the 
ratio of B : b is expected to be the same in A plants as it is in 
a plants. In short, with good gene segregations and no linkage 
the relative frequencies of the four classes are expected to be 
9 AB ; 3 Ab : 3 aB ; 1 ab. Thus m^^^, and m*- 

Let Oj, Of, tta and 04 be the frequencies observed, with /S(a)-n. 

Using the general formula, a x^ ^ degrees of freedom can 
be calculated to test the goodness of fit of observation with 
expectation in the four classes jointly. We require to analyse 
thia into its three simple components in such a way as to tell us 



182 


STATISTICAL ANALYSIS IN BIOLOGY 


something about the nature of the two segregations and about 
the possible linkage of the two genes. 

For the purpose of testing the segregation of A and a, the 
classes AB and Ab are pooled to give one A class, and aB and ab 
are pooled to give a single a class. The former is expected to 
be three times as large as the latter, i.e. and Then 

a testing the segregation of A and a will depend on the 
comparison 




The coefficients used in the calculation of this simple x* 
be 1 for both A classes and -3 for the two a classes. In terms 
of the four classes observed the comparison must then be 




The first condition of partition is fulfilled as 

and 3 

and 

so that -iS(wifci)=^(9+3-9-3)=0 

The divisor for this comparison is found from the formula 
nS(mkx^) and is 



(9+3+27+9)-3n 


So testing the A: a segregation is 


(ai+at-3a3-3a4)* 

3n 


It will be seen that this is simply a modification of the two- 
class segregation formula of Section 45, the two classes each being 
subdivided into two sub-classes which are then treated alike. 
Similarly the x^ testing the B:b segregation will be 

(ai-3a,+a3-3a4)* 

371 


These two components are independent as 

t?l4fcl4fc|4“'j^x(—3)x(—3) J 

This leaves only the question of linkage x^ settled. 

What values should be assigned to fc,,, &c. ? Only one set 
of values can possibly be taken by these coefficients, as the 
deduction of the two simple ;t*’s already found from the compound 




183 


PARTITIONING X* 


X* for 3 degrees of freedom can leave but one possible formula 
for the remaining component. It is, however, not necessary to 
arrive at it by the method of subtraction. Fisher’s way of 
finding the coefficients in the comparisons on which components 
of sums of squares are based (Section 25) applies here also. 

The two genes are each expected to show a segregation of 
3:1. Then x^ testing the segregation of A: a, i.e. the ‘ main 
effect ’ of A, can be represented by 

(A-3a){B+b)=AB+Ab-3aB-Sab 


from which the k values are easily derived. 

Similarly, the segregation of B:b is represented as 

{A+a)(B-^b)^AB-3Ab+aB-3ab 

Lastly, the linkage component or ‘ interaction ’ of A and B is 

{A~^a){B-Sb)=^AB-3Ab-3aB+9ab 

Thus jfcai-l, A:38=-3, fcas—3 and ^34=9. The divisor for this 
comparison is 


n 


n^(m;b32)=^[(9xP)+(3x[-.3]2)+(3x[-3]*)+(lx92)] 

lo 


n 


and 


^^:i{9+27+27+81)-9n 

j (ai-3aa-3a3+9a*)* 

9n 


That this is independent of the two previous comparisons is 
verified by 

<Sr(wiA:,A:3)"*[(9^1^1)+(3xlx[-3])+(3x[-3]x[-3])+(lx[-3]x9)] 

--^[9-9+27-271=0 

and ;Sr(mib3l:3)=*[(9xlxl)+(3x[-3]x[-3])+(3 xlx[-3 ])+(lx[-3]x9)] 

-^[9+27-9-271=0 

The partition is now complete, the formulae for the three 
simple components of the x^ being 


A : a X 


j (ai+a,-3a3-3a4)* 


[1) 


3;t 


U . V, ..2 _(ai-3a,+a,-3a.)» 

^ -3^^- 

T. , * (ai-3a3-3a3+9a4)* 

Linkage ^^ - - 

Details of two families of the type under discussion are given 
in Table 42. They were recorded by Philp and Imai respectively, 
the plants being Papaver and Pkarbitis. The gene symbols used 
by these authors were not A,a and B,b, but the latter are more 
convenient for our purpose. 



184 


STATISTICAL ANALYSIS IN BIOLOGY 


TABLE 42 

PhiJp*8 and ImaVa Data on Two-gene Segregations 



Numbers of plants in classes 

Total 

AB 

Ab 

aB 

ab 

Philp Observed (a) . 

72 

29 

40 

12 

153 

Expected (mn) 

860625 

28*6875 

28*6876 

9*5625 

153 

Imai Observed (a) 

123 

30 

27 

21 

201 

Expected (mn) 

1 

113-0625 

37*6875 

37*6876 

12*6625 

201 


The expectations given in the table are those based on the 
9 : 3 : 3 : 1 ratio. A compound for 3 degrees of freedom can 
be found from each set of data. Thus in Philp’s case it is 


X\e\ 


\mn) 


72* 


29 


40 


12 * 


-153 


86*0625 28*6875 28*6875 9*5625 
«7*3834, which has a probability of between 0*10 and 0*05. 
For Imai’s data 


X\e\ 


123* 


30* 


27 


21 * 


-201 


113*0625 37*6875 37*6875 12*5625 
-11*1393 with a probability of between 0*02 and 0*01. 

The departure from expectation is clear in Imai’s case and, 
though not quite significant in Philp’s data, is at least suspiciously 
large and worthy of further investigation. The next step is that 
of partitioning x\^^, using the formulae developed above. In 
Philp’s case 

- [72+29-{3x40)-(3xl2)]* 

X ti]- 


A: a 


B :b 


X\\\ 


Linkage 


X\i\ 


3x153 

*6*5904 with probability 0*02-0*01 
[72-(3x29)+40-{3xl2)]* 

3x153 

*0*2636 with probability 0*70-0*50 
[72-(3x29)-(3x40)+{9xl2)]* 


9x153 

-0*5294 with probability 0-50-0*30 
The three items sum to give ;ic*[8j"7‘3834, so providing a check 
on the arithmetic. 










PARTinONTNO 185 

It is now abundantly clear that the departure from expectation 
lies solely in the lack of agreement of the A : a segregation with 
its expected 3:1. The B : b segregation is good and there is no 
evidence of linkage. It may also be noted that when is 
partitioned the significance of the departure of the A: a segregation 
is increased, because the single large is isolated from the two 
smaller ones which previously masked its full effect. 

The partition in Imai*s data is 


A : a 


B:b 


Linkage 


2 _[123+30-(3x27)>(3x21)]* 

- 0-1343 with probability 0-80-0-70 

[123-(3x30)+27-(3x21)]2 

3x201 

■ 0*0149 with probability 0-95-0-90 

[123-(3x30)-(3x27)+(9x21)]2 

9x201 




v2 

X lu 


-10*9900 with probability less than 0*001 

Again the calculation is checked by summing the three items to 
give 11*1392. The difference of 1 in the last decimal place 
between this value and that obtained earlier is due, of course, 
to arithmetic approximation. 

Here the situation differs from that in Philp’s poppies. 
Both genes show good 3 : 1 segregations, but they are not 
independent in segregation, i.e. they must be linked. 

It is now possible to see that the two families depart from 
expectation in quite different ways. Partitioning x^ fully 
revealed the genetical situation in each case. 

More complex x^*^ can be partitioned by exactly the same 
methods. Suppose there are three genes, A,a, B,b, C,c, segregat¬ 
ing in a backcross (AaBbCcxaabbcc). Each gene is expected to 
show a 1 : 1 ratio and, in the absence of linkage, the three 
segregations are independent. Hence eight classes, ABC, ABc, 
AbC, aBC, Abe, aBc, abC and abc are expected in equal numbers. 
A compound x^ for 7 degrees of freedom can be calculated from 
the data as a whole. How should this be partitioned in order 
that the genetical situation can be fully appreciated ? 

Fisher’s method may be used to find the k coefficients in the 
seven comparisons on which the partition is based. These must 
clearly consist of one for each of the three gene segregations or 
* main effects *, one each for the three linkage pairs or ' first- 
order interactions ’ and one for the ‘ second-order interaction * 
which here has no simple genetical interpretation. The i’s are 
found as shown in Table 43. 



186 


STATISTICAL ANALYSIS IN BIOLOGY 



>v 

£ 




«o 




abo 

777 

^ ^ 

1 

o 

•A 

d 

1 1 

fH ^ fH 

1 1 

- 

aBo 

7^7 

1 1 

- 

ca 

d 

1 

7 1 

1 

Abo 

^77 

^ wM 

1 1 

- 

AbC 

7 

7 -H 7 

1 

ABo 

^ -j, 

7 1 

1 

ABC 

^ 

^ ^ 

- 


(S 

3 

3 

u 

e 

pc, 


« V o 

+ + I 

ooo 

•o S’S* 

+ I + 

SSS 

'o'5''o 

i + + 
^ ^ 


o o o 

+ I I 

ooo 

I + I 

SSS 

I I + 

■TJ TJH TjJ 


O 

I 


I 

s 

o' 

I 


4» 

o 

iS 

a 


• V 


• • 


• # 




-tJ '0 

G a a 

d d d 


I 

'§ 





THE EFFECT OP FITTING A PAHAMETER 


187 

It should be noted that {A-a) is used, and not (A-Sa) as in 
the previous example, since A and a are expected now to be 
equally frequent and not to show a 3 : 1 ratio. 

In every case and so the divisor appropriate to 

each comparison is n. The calculation of the 21 different 

quantities shows that all the comparisons are 
independent. 


47. THE EFFECT OF FITTING A PARAMETER 


Where a given number of individuals are assignable to 
j classes there are usually ^’-1 degrees of freedom for the calculation 
of a testing the general agreement of observation and expecta¬ 
tion over the whole of the data. Thus, in the case of the four- 
class segregation analysed in Example 22 a compound was 
calculated for 3 degrees of freedom. This can be partitioned 
into three components each depending on a single comparison 
between the observed frequencies. It is then clear that adjust¬ 
ment of one or more of these comparisons will reduce the value 
of the total x^ a characteristic way. This can perhaps be best 
understood from a further analysis of Philp’s poppy data of 
Table 42. 

Example 23. We have seen that the departure of Philp’s 
observed frequencies from the 9 : 3 : 3 : 1 expectation is due to 
the gene A,a failing to give a 3 : 1 ratio. New expectations can 
be formulated using the observed A,a ratio in place of the original 
3 : 1. There were 101 A plants and 52 a plants, so that the 
ratio is 1*9423 : 1. Since the segregation of gene B,b is still 
expected to be 3 : 1 and the two genes segregate independently 
of one another the new expectation is 

5*8269 : 1*9423 : 3 : 1 


which, with a total of 153 plants, gives as the expected frequencies 

76*75 AB, 25*25 Ab, 39*00 aB, 13*00 ab 


Using the formula 



the general goodness of fit is found to be tested by ;j'*“0*8450. 
This is lower than the x^ obtained when the 9 : 3 : 3 : 1 expectation 
was used, but it also has 1 degree of freedom less, as will be 
readily appreciated when a partition is made. 

The coefficients are found by Fisher’s method 

A:a (A-l*9423a)(B+6) ~AB+Ab-l-9423aB-l'9423ab 
B:b (A^a){B-3b) ~AB-3Ab+aB-3ab 
Linkage (A—1*9423ce)(jB- 36)-AB—3A6—l*9423aB+6*8269a5 
It should be noticed that (A-l*9423a) replaces {A~3a) now 
that it has been agreed to discard the 3 : 1 expectation in favour 
of the ratio observed. 



Igg STATISTICAL ANALYSIS IN BIOLOGY 

The formula nS{mk^) shows that the divisors of the three 
comparisons are l'9423w-, Zu and 5*8269/1 respectively. The 

three portions of 

A-a 7^=—-(ai+a,-l*9423a3-l*9423a4)^ 

^ l*9423w^ 

B:b ;^2=^(ai-3a,+a3-3a*)* 

Sn 


Linkage 


y2_—(ai-3a3-l*9423fl3+5*8269a4)* 

^ 5*8269n^ 


The first item, testing the A: a segregation, must clearly be 0, 
since the of the comparison are directly derived from the 
segregation observed. In other words, this component of has 
been fixed and eliminated from the analysis by the adoption of 
the observed A; a segregation. Calculation of the parameter 
measuring the A: a segregation has removed one component of 
and so reduced the total number of possible comparisons, 

i.e. of degrees of freedom, by 1. 

On substituting the frequencies observed in the above formulae 
the other two components of x^ found to be 0*2636 and 

0*5815 respectively. These sum to 0*8451 as found for the 
compound which is thus seen to be separable into only two 
simple components. The B:b component is exactly the same 
as in the previous analysis, but the linkage item has changed 
from 0*5294 to 0*5815. When the formulae for this component 
in the two analyses are compared, it will be observed that the 
change in the A : a expectation has altered the k'a in the linkage 
comparison, so changing its arithmetic value a little. In genera 
terms we might say that the previous hypothesis had been 
rendered unlikely and replaced by a new one. It is not ^ 
expected that the goodness of fit will remain unchanged under 


these circumstances. 

The principle that a degree of freedom is lost when a param^ 
is fitted is of general application in statistical analysis. It h^ 
earlier been seen in operation in fitting the normal distribution 
(Section 10), where the variance is estimated from n-1 
of freedom after the mean is calculated. If a regression coeflScien 
is estimated another degree of freedom is lost from the matena 
for the calculation of the residual sum of squares, and so <m- 
The x^ goodness of fit falls into line with these ot e ■ 

At the risk of anticipating the next chapter it may ^ 

here that the method of estimating the parameter must 
good one or the component which is supposedly 
actually be reduced to 0. In this way a spuriously large dep^ 
icom expectation might be found as a result of faulty estima 



HETEROGENEITY OF DATA 


189 

The question of what constitutes a good method of estimation 
must be reserved for later comment. 


48. HETEROGENEITY OF DATA 

Frequency observations are often made in replicate, as, for 
example, in the case of Sirks’s data on the segregation of the 
gene P,p in Datura stramonium (Example 5). In such cases it 
is of obvious importance to deter min e whether all the series of 
observations are in agreement with each other as well as with 
the expectation. A calculation designed to answer this question 
is generally known £is a test of heterogeneity of the data. 

The simplest case of this kind is that of two sets of observations 
in each of which the individuals may fall into either of two classes. 
Sirks^s data referred to above are of this kind. 

Example 24. As described in Example 6 two families were 
raised each of which was expected to show a 3 : 1 segregation 
for P : p plants. The details of the two families are given in 
Table 44. 


TABLE 44 


Segregation in Datura stramonium (Sirks) 


Family 

Number of plantd 

x' 

p p 

Total 

1922 . . 
1925 . . 

69 14 

103 23 

73 

126 

1-3196 

3 0682 

1 

Total 

162 37 

199 

4-3568 


In effect, the data consist of a four-class table, the classes 
being distinguished by the simultaneous classification for type 
and family. The hypothesis to be tested is that the two families 
agree with each other in showing a 3 ; 1 segregation. 

Three degrees of freedom, or comparisons, are available from 
such a four-class table, but they are not of equal importance 
with respect to this hypothesis. The analysis of Section 46 
shows that the three simple made to test 

(i) The agreement of the P : p totals with hypothesis. 

(ii) The agreement of the family totals with h 3 rpothesi 8 . 

(iii) The independence of the two classifications. 

The first of these is clearly found from the P : p totals as 

"4-3568, and tests the general agreement of 

oX X 

the data with the 3 : 1 expected. 

13 






190 


STATISTICAI ANALYSIS IN BIOLOGY 


The second component cannot, however, be calculated, 
because the hypothesis makes no stipulation about the size of 
the families. The third x\i] can be found in exactly the same 
way as the linkage component in the analysis of Philp’s poppy 
data after the expectation of 3 : 1 for the A: a segregation had 
been abandoned. This item is the one which will answer the 
question of the agreement of the two families with each other. 
It is an interaction term and, as such, tests whether classification 
into P and p is affected by the family classification. 

1 2fi 

The ratio of the family size is — : 1 or 1‘7260 : 1. Hence 

73 

the third or heterogeneity component of x^ is 


, [103-(3x23)-(l*7260x59)+(3xl*7260xl4)]2_^^^,^ 

^3x1-7260x199 


with a probability of 0-90-0 80. 

This heterogeneity analysis can, however, be carried out by 
a more general and more convenient method. We have seen in 
the last example how a compound x^ for 2 degrees of freedom 
can be calculated from a two-way classification when the observed 
ratio for one of the classifications is itself used in place of any 
expectation. A little consideration of this operation as carried 
out on Philp’s data will show that the calculation of this x^ f®^ 
2 degrees of freedom really resolves itself into finding two simple 
X^'s. Each of these tests the agreement of the data with the 
other expected ratio within one of the two categories separated 
by the classification for whose result no expectation exists. 
Thus with Philp’s data we found, on using the general formula, 
that the compound x^ for 2 degrees of freedom was 0*8450. This 
could equally well have been obtained by summing the two 
simple x^'^ testing the agreement of the B : b segregation with 
the expected 3:1m the A plants and in the a plants separately. 
These two items are 


[72-(3x29)]^ [40-(3xl2)]« 

3x101 3x52 

which give a sum of 0*8450. 

Hence the heterogeneity analysis of Sirks’s data can be earned 
out by calculating three x^> ®*ch having 1 degree of freedom, 
from the two single families and from their total, respectively. 
These three items, shown in the appropriate lines of Table 44, 
are found as 

[59-{3xl4)]2 [103-(3x23)]* [162-(3x37)]* 

3x73 ’ 3x126 ' 3x199 

giving 1*3196, 3 0582 and 4-3568. The last one is the test of 
agreement of the total data with expectation. The sum of the 



HETEROGENEiry OF DATA 


191 

first two, viz. 4*3778, is the compound for 2 degrees of freedom 
which includes the item testing total agreement, as already 
found, together with the heterogeneity item. The latter can 
then be obtained by subtraction : 

;f2j|j=4-3778-4*3568=0-0210 

The analysis is set out in tabular form in Table 45. 

TABLE 45 

The Analysis of x* for Sirks's Data 
Item z* N 

Deviation .... 4-3568 1 

Heterogeneity . . . 0-0210 1 

Total .... 4-3778 2 

The deviation of tests the agreement of the totals with the 
expected segregation. The total is, of course, the sum of the 
two items from the single families. 

This analysis is of some interest in consisting essentially of 
two steps. The first operation is that of building up a compound 
X^ from its simple items, the second step is that of breaking 
down this same compound into simple components on a new basis. 
It is a remarkable example of that power of synthesizing and 
re-analysing a compound along new lines which is conferred by 
the system of independent comparisons developed in Section 46. 

When the heterogeneity analysis is carried out in this way 
it is easily seen that the method can be extended to data including 
any number of families. As an example we may take certain 
families of Antirrhinum mafus which were segregating for yellow 
and ivory flower colour (Mather’s data). ' 

Example 25. Four plants known to be heterozygous for the 
gene in question were self-pollinated and the individual progenies 
so obtained gave the following segregations : 


P 

0-05-0-02 

0-90-0-80 


TABLE 46 


Segrtgaiioix in Antirrhinum majus {Mather) 


Family 


Yellow 


Total , , 


Numbers of plants 


Ivory 





Total 


116 


5- 0690 

6- 4444 
0-4000 
Mill 


0-7353 










192 


STATISTICAL ANALYSIS IN BIOLOGY 


A simple is calculated from each family separately, testing 
its agreement with the expected 3 : 1 ratio. Thus family 1 gives 

y2 .„[?L_(?1?1L=5-0690. The sum of the four y®'s obtained in 
* 3x29 

this wav from the separate families is itself a compound x* 
12'0245 for 4 degrees of freedom. It consists of two parts, that 
testing the general agreement of the totals with the 3 : 1 expecta¬ 
tion and the heterogeneity item. The former is found from the 
total segregation and has, of course, 1 degree of freedom. The 
latter is found, as in the last example, by subtraction and must 
clearly have 3 degrees of freedom. The analysis is set out in 
Table 47. 


TABLE 47 

Analysis of y* for Mather's Data 

Item N P Mean 

square 

Deviation. 0-7353 1 0-60-0-30 

Heterogeneity .... 11-2892 3 0-02-0'01 3*7631 

Total. 120245 4 


The result of this analysis is itself worth considering in sonae 
detail. The totals agree with the 3:1 expected but there is 
significant heterogeneity, i.e. the famihes disagree with one 
another. This last result must mean that the agreement of the 
totals with expectation is fortuitous, as it shows that all the 
component families do not themselves agree with this expectation. 
Heterogeneity disposes of the hypothesis as effectively as a 
significant deviation item. 

We may note also that where heterogeneity is present the 
totals are subject to a variance greater than that expected on 
hypothesis, because the discrepancies between the fam^es add a 
new item to the variation. The correct test of sigmficance of 
the deviation item is thus not given by the use of but by the 
comparison of the deviation x^ with the mean square obtained 
from the heterogeneity item, to give a variance ratio. In the 

present example the mean square is --3*7631, which, being 

o 

larger than the deviation item, shows the latter not to be 

significant. , 

When the variance is demonstrably greater than that expecte 

from hypothesis the x^ must be abandoned, as its use pre 
supposes that an ol^rved variance is being compared with 
hypothesis. In its place we must resort to the z test, or to its 
transformation, the variance ratio, by which two observed 
variances are compared. 



THE 2x2 CONTINGENCY TABLE 193 

The method of heterogeneity analysis can be adapted to 
more complex cases, and used in conjunction with partition by 
orthogonal functions, but the reader must be referred to Mather’s 
account of such analyses for further details. 

49. THE 2x2 CONTINGENCY TABLE 

We have seen how one of the main comparisons in a four- 
class table can be removed from the analysis of without 
affecting the test of agreement of either the remaining main 
comparison or the test of independence of the two comparisons. 
It is possible to take a further step and to accept the results of 
both classifications as their own expectations, so to speak, and 
yet still to be able to test the independence of the two classifica¬ 
tions. In such analysis the two testing the main effects or 
classifications are being removed from consideration, attention 
being confined to the third simple testing the interaction or 
independence of the two methods of classification. 

Example 26. Lawrence and Newell, while experimenting 
with the composition of soil composts, recorded the following 
hitherto unpublished results of a germination trial with Primula 
sinerisis seeds. The seeds were divided into two groups and set 
to germinate in dishes containing filter papers soaked respectively 
in rain water and water which had been allowed to seep through 
loam before use. The germinations were as follows : 

TABLE 48 


Oerminaiion in Primula (Lawrence and Newell) 


— ! 

Germinated 

Ungerminated 

Total 

Loam water 

37 

13 

60 

Rain water .... 

32 

18 

60 

Total 

69 

31 

100 


Does the type of water affect germination ? 

The two marginal classifications are clearly uninteresting and 
indeed have no expectations. The number of seeds in each test 
can be varied at will and the percentage germination which is 
observed depends on many other factors such as the genetical 
constitution of the seeds, age, aeration, and so on. The question 
at issue really reduces to that of testing the independence of the 
two classifications. To obtain an answer it is necessary only to 
calculate the interaction x^ item. 

The calculation of such a contingency x^ can be undertaken 
by an extension of the methods of Example 24. Reverting to 






194 


STATISTICAL ANALYSIS IN BIOLOGY 


Philp’s poppies, it was seen that, where a 3: 1 was expected 
for each of the two main classifications, the interaction, or, in 
that case, linkage, was found as 

2 _[ai-3a2-3a3+(3x3)a4]2 

When it was decided to substitute the observed ratio, which may 
be denoted as : 1, for the A: a segregation, the interaction 
item became 

„2 Z^a3+(3xZ^)a4]* 


X [ir 


3x/jX71 


If the next step is taken and the observed segregation, : 1, 
for B : b is used, the interaction shown by Fisher’s 

method to be 

^lAhn 

This is the formula necessary for the analysis of Lawrence and 
Newell’s data. The ratio of the number of seeds in the two 
tests, Zr ; 1, is 1:1. The ratio of the two germination classes, 

69 

denoted as Zq : 1 , is — : 1, i.e. 2-2258 :1. Then the contingency 
X^ is 

Ijlgn 

[37-(lx32H2-2258xl3)+{lx2-2258xl8)]* 

lx2-2258x 100 

J4:^“=M688 

222-58 


which for 1 degree of jfreedom has a probability of 0 - 30 - 0 - 20 , 
showing that there is no interaction between the classifications, 
i.e. that the type of water does not affect germination. 

This method of finding x^ is not, however, the one in common 
use for the 2x2 contingency table. The usual formula does not 
require the calculation of the two marginal ratios. It is not 
difficult to derive from the formula \ised above. 

The contingency table may be written out, using the values 
ai-a 4 for the four observed class frequencies. It then becomes 




THE 2 xj TABLE 


195 


The margins show ratios of : i and : 1 , which may 

a^+aa ^ 

be substituted for S’Hd Iq in the formula already used for the 
calculation of 


x\ir 


\a2+a4/ \a3+a4/ (a3+a4)(aa+a4)J 


{a^+a^Xa^+a^) 


n 


{ai+a^)(ai+a^) 

[al(a2+a^){a^+a^)-ai{a^+a^){a3+ai)~a:,{a^+at){a2+ay) 

- - _ +at{ai+a2)(ai+a2)y 

{ai+a2){ai+a3){a2+a^){a3+at)n 

Expanding and collecting like terms in the numerator 

2 [fli®4(fl'i+®2+CE3+ff 4)*“CtaCt3(ff |+fl 2 + Cl3+Cl4)J ® 

^ (ai+a2}(ai+a3){ai+at)(a3+ai)n 

which, since n=S{a), reduces to the widely used form 

2^ _ {aiai-aiCijVn _ 

^ (ai+a2)(a,+a3)(a2+a4)(a3+a4) 

Applying this to Lawrence and Newell’s data, where ai=»37, 
Oa=13, a3-32, and 04 = 18 , with n= 100 , we find 


Xii)' 


[(37xl8)-(32xl3)]n00_6,250,000 
50x50x69x31 “573477500' 


M 688 


as before. 

Before leaving the subject of the 2 x 2 contingency table it 
may be noted that Yates’s correction for continuity can be 
applied to prevent overestimation of the significance where the 
class frequencies are expected to be small. This is done by 
deducting 0-6 from each of the two classes in opposite corners 
giving the larger product, and adding 0*5 to each of the other 
two classes giving the smaller product. In the present example 
it would be necessary to reduce 37 and 18 to 36-5 and 17-5 respec¬ 
tively, and to increase 32 and 13 to 32*5 and 13-5 respectively. 
The effect in this case would be negligible. 

For tables with very small class frequencies Fisher’s exact 
treatment is to be recommended. This is based on the multi¬ 
nomial expansion and hence is an extension of the type of 
calculation discussed in Chapter II. 


60. THE 2 x> TABLE 

The 2 x^* contingency table * is an extension of the type of 
heterogeneity analysis illustrated in Example 25 just in the same 
way that the 2 x 2 table is an extension of that of Example 24. 

* Often referred to ns the 2xn table. 



196 


STATISTICAL ANALYSIS IN BIOLOGY 


It can be analysed by treating it according to the method of the 
heterogeneity test. In a contingency table there is, however, no 
expectation for the class frequencies in either margin of the table, 
which means, of course, that no can be calculated to test 
agreement with expectation of the two totals in the bottom line 
of the 2xj table. This simple x^> termed the deviation item in a 
heterogeneity test, is omitted and the ratio observed in this 
margin must be used to supply the k coefficients which are 
applied to the class frequencies in calculating x^ from the single 
families. The sum of the simple x^*^ found in this way from the 
single families will be a compound x^ corresponding to a number 
of degrees of freedom one less than the total number of families. 
This will be better appreciated from an example. 

Example 27, G. Hartman has recorded the frequencies of 
men and women having certain given thresholds for tasting 
phenylthiocarbamide. Eleven strengths of solution of this 
substance were used, being labelled 0 to 10. The subjects of the 
test were given solutions of the various strengths to taste, the 
concentration immediately below that at which they commenced 
to taste being counted as their threshold. Twelve classes of each 
sex were obtained, as some subjects could taste solution 0 and 
were thus classed as having the threshold <0. The results of the 
test are given in Table 49. Do the sexes differ in the distribution 
of the thresholds that they show ? 


TABLE 49 

Hartman's Data on Taste Thresholds 


Threshold 

Frequency in 

Men (o,) Women (a,) 

Total (n) 

r* 

n 

10 

15 

42 

57 

10-7507 

3-9474 

9 

35 

52 

87 

21115 

14-0805 

8 

46 

38 

84 

1-5327 

25-1905 

7 

31 

30 

61 

0-1925 

15-7541 

6 

23 

19 

42 

0-7664 

12-5952 

5 

13 

17 

30 

0-2632 

5-6333 

4 

9 

6 

15 

0-8635 

6-4000 

3 

7 

5 

12 

0-5120 

4-0833 

2 



20 

0-0316 

5-0000 

1 

13 

19 

32 

0-6998 

5-2813 

0 

25 

33 

58 

0-5601 

10-7759 

<0 

63 

43 

106 

6-5391 

37-4433 

Total . 

290 

314 

604 

23-8231 

139-2384 


In the absence of detailed knowledge about the determination 
of the threshold there is clearly no way of formulating an 






197 


THE 2 X J TABLE 


expectation for the total number of people falling into each thresh¬ 
old class. Similarly, tliough approximately equal numbers of 
men and women are expected in the general population, there is 
no reason why equal numbers should be included in the experi¬ 
ment. The ratio obtained will depend on the circumstances 
under which the experiment is conducted. In this way the 
problem reduces to that of testing the independence of the two 
classifications, by sex and by taste threshold. 

Now if there were an expectation of Z: 1 for sex the hetero¬ 
geneity would be found by subtracting from the sum of the 
single threshold class ;^®’s the x^ found from the total segregation, 

the formula 

In 

both single thresholds and from the total. In the absence of 
such an expectation it is necessary to use tlio observed sex totals 
to estimate 1. In this way the x^ from the total is reduced to 
0 and the sum of the found from each threshold, is itself 
the heterogeneity, or as we now term it, contingency, x^» If 
has 11 degrees of freedom just as the heterogeneity x^ would 
have if the calculation had been made using an expected sex 
segregation. 

In practice it is more convenient to use the sex totals in the 

calculation rather than the ratio ^ : 1. Thus the x^ item from 

314 

the class of threshold 10 is found from the formula 


being used to find these simple x^'^ from 



[(15x314)-(42x290)] 
314T2 9^57 


2 


55,800,900 

5,190,420 


10-7507 


The x^ item from threshold 9 is 

r(35x314)-(52x290)]« _^.,,,^ 
^ 314x290x87 


and 80 on. It will be observed that on applying this method of 
calculation to the totals 


J(290x3 1 4 )-(3Ux290) P_p ^ 

* 314x290x604 

The sum these x^ components from the 12 thresholds is 
23-8231, which has, as already noted, 11 degrees of freedom. 
Such a value has a probability of 0-02-0 01, and so there is good 
evidence that the dis<-^bution of thresholds is dependent on the 



19S 


STATISTICAL ANALYSIS IN BIOLOGY 


sex. Men appear to fall more often into class <0 and less often 
into class 10 than do women. 

There is another method, due to Brandt and Snedecor, of 
finding the contingency from a 2xj table. It is very valuable 
in some cases where the subdivision is more complex than in 
the present example. Mather has described such a complex 
example which brings out the full value of this method of 
calculation. 

The 2xj table may be written in the general form of 


Oil 


ni 

U21 


w, 

U3I 

fits* 

A 

713 

0 

• 

• 

• 

• 

• 


CEji 



Oyi 

(It2 

n-i 


Let us consider the calculation of a heterogeneity when 
the total of individuals is expected to consist of min-p and 
m,nr of the two types respectively. Each line of the table 
contributes an item of the type 

miTritni 

to the total x*- The deviation found in like manner from 
the totals in the bottom line of the table, is 


Hence the heterogeneity x* obtained by subtraction is 


1_ ntinizni J 

m^mX L J , ,v 

(m, 2m ,1 






199 


Now 


THE 2 X j TABLE 


=ni-2au+ 


and, similarly, 


= nr-2ayi+—^ 
Tly 71 j* 


Hence jS 


\ 7^1 / W r 


>S(nj)-2*S'(aii)+>S' 




+2a^,- 




which, since n 2 *=*S'(ni) and gives 


\ nj / nj. \ / nj. 

It can similarly be shown that 

ciiidi n X <i'pi<^T2 r 

\ Til "/ «r L \ J 

The formula for the heterogeneity then becomes 


/ giiCT|3 \_‘ 

\ Til y 


'f 

\ Til / Ut 

_r^M_a^n 
L \ Til / Tir J 


X^= —-— r s( (iniH2mimi-¥m2^) 

WiTTiaL \ Til / Tlr J 

_ (mi+m2)^ r^/gii®\ qri®~[ 

TiiiWit L \ Til / Tiy J 

It is also obvious that since 

\ Tij / Wy \ Wy / Wr 


g (mi+ma)y 

^ mim* L \ Til / 


Tly 


The fraction (TTii+Tiig)^_—1— gj^ce mi+m* is chosen equal to 1 

fTiiTTlj TnilTlt 


Ut 

minT^rriinT 

It will be seen that this formula for the heterogeneity x'^ *3 
divisible into two parts. One of these, s (-^)—is inde- 

^ V Wl / Wy 

pendent of the wii: wij expectation, and so gives a quantity 
proportional to x^ lor any expectation. It may be analysed in 
the same way as x^ itself, but without prejudice to the nature 
of the expectation. The results of such an analysis are then 
converted into x^*^ testing agreement with any expectation 
wii; wi* by means of the appropriate multiplier, which forms the 
second part of the formula. If the x® question is for a con¬ 
tingency test, expectations trii and w* are not available and the 
observed totals an and ay* are used to replace WiWy and m*nn 
80 giving 

* ro/gii^v^rin 

^ ^"‘^“ayianL \ t^i / J 


200 


STATISTICAL ANALYSIS IN BIOLOGY 


This formula can be applied to Hartman’s data. The values 


of &c., are to be found in the right-most column of Table 49. 

2 

The value in the bottom, or total, line is Then 


5f5ULy^=(3-9474f 14-0805 . . +37-4433)-139-2384=5-9464 

\n^) Ut 


The multiplier 
and 


^=_^=?^= 4.0063 
290x314 91,060 


ni]=5-9464x4-0063-23-8231 


The great advantage of this method arises from the fact that 
the quantity proportional to analysed and later, by 

the use of an easily determined multiplier, converted into 
a x^ depending on any given expectation. Thus, if we decided 
to expect equality of the sex frequencies in Hartman’s data the 

multiplier would be - - -, where mi=m2=4, i.e. 4-0 and 

;^2=5-9464x4-0000=23‘7856. The major part of the work has 
been done before the expectation is introduced. Many additional 
expectations can be tested without much additional computation. 
With complex data this advantage may be very great. 


51. THE GENERAL CONTINGENCY TABLE 

The general contingency table may have any number of rows 
and columns. The two marginal or main classifications have no 
expected values and the table is used solely to test their inde¬ 
pendence. If such marginal expectations did exist the analysis 
would take the form 


Item 

Rows 

Columns 

Interaction 


Total 


N 

r-1 

c-1 

(r-lKc-l) 


re—1 


where the table has r rows and c columns, i.e. rc entries in all. 

In using the marginal totals as their own expectations, so to 
speak, the first two items of the analysis are removed firom 
account. Their x^*^ are artificially made to take the value 0 and 
so the degrees of freedom to w^ch they correspond are lost. 
Only the interaction component, with (r-l)(c-l) degrees of 
freedom, remains. 

No specially simple method of calculation has yet been 



THE GENERAL CONTINGENCY TABLE 


201 


devised for such tables. The expectations for each cell of the 
table must be found from the marginal frequencies and the general 

formula v^=s{ used for the calculation. 

\mnj 


Example 28. Table 50 gives data obtained by Catcheside 
during his analysis of the secondary association of chromosomes 
in BrOrSsica oleracea. The pollen mother cells were classified 
according to whether they had 3, 2, 1 or 0 pairs of bivalents 
showing secondary association at metaphase. Three preparations 
were studied, and it was desired to know whether the classification 
of the pollen mother cells could be considered as constant from 
slide to slide. 


TABLE 50 


Secondary association in Brassica (Catcheside) 


Number of 
pairs 

1 

Slide 

2 

3 

Total 

0 

14 

7 

11 

32 

1 

(13-1039) 

32 

(9-9703) 

36 

(8-9258) 

35 

103 

2 

(42-1780) 

61 

(32 0920) 
39 

(28-7300) 

32 

122 

3 

(49-9685) 

41 

(38-0119) 

23 

(34-0296) 

16 

80 


(32-7596) 

(24-9258) 

(22-3146) 


Total 

138 

105 

94 

337 


The numbers in brackets are expectations based on the marguial 

totals. 


The first step in the analysis is to find the frequencies expected 
for eacli pairing category in each slide. Of the 337 pollen mother 
cells observed, 32 fall into the 0 class. If the classification is 
the same in all slides we should expect a fraction of these 
32 to be in slide 1, a fraction in slide 2 and in slide 3. 
The expectations for the three cells in the top row of the table 

or 13*1039, 9-9703 and 8-9258. 


are thus 


32x138 32x105 


and 


32x94 


337 ' 337 337 

The other expectations are found in the same way. That for 

80x94 

the bottom right-corner cell is, for example, —or 22*3146. 

The formula y^~»s(—^- n is then applied to these figures. 

^ \mnj 



202 


STATISTICAL ANALYSIS IN BIOLOGY 


There are twelve items, one from each cell, of the type —. 

mn 


That from the top left-hand corner is, for example, 

142 

— -=14-9574 

13-1039 


and that from the bottom right-hand comer is 


16* 

22-3146“ 


11-4723. 


The others are found in just the same way. The sum of the 
twelve is found to be 346-9055 and ;);*=346-9055-337=9*9055. 
The table has four rows and three columns, so the contingency 

has (4-l)(3-l) or 6 degrees of freedom. Reference to the 
table of shows that ;^*[qj= 9-9055 has a probability of between 
0-20 and 0-10, so that it may be concluded that the classification 
into secondary association types is reasonably consistent over all 
three preparations. 

The general method of calculation could, of course, have 
been applied to 2x2 and 2xj tables, but it is not so convenient 
to use as the special methods which have been developed for 
these special cases. 


REFERENCES 

CATCHESiDE, D. O. 1937. Secondary pairing in Braaaica oleracea. Cyto- 
logia, Fuji! Jub. Vol., 366-78. 

FISHER, R. A. 1944. Statistical Methods for Research Workers. Oliver 
and Boyd. Edinburgh, 9th ed. 

HARTiiAN, o. 1939. Application of individual taste differences towards 
phenyl-thio-carbamide in genetic investigations. Ann. Eugen.t 
9, 123-35. 

iMAi, Y. 1931. hmkage studies \n Pharbitis Nil. Ocne/tca, 16, 26-41. 
MATHER, K. 1937. The analysis of single factor segregations. Ann. 
Eugen., 8, 96-105. 

- 1938. The Measurement of Linkage inHeredity. Methuen. London. 

PHTLP, J. 1933. The genetics of Papaver Rhoeas and related forms. 
J. Genet., 28, 175-204. 

BiRKS, M. J. 1929. Mendelian factors in Datura III, Oenetiea, Ht 
257-66. 



chapter XII 
ESTIMATION AND INPORMATION 

62. PROBABILITY AND LIKELIHOOD 

BROADLY speaking, statistical operations fall into two 
categories, viz. tests of significance and the estimation of para¬ 
meters. Of these two, we have been mainly occupied up to the 
present with tests of significance, though some problems of 
estimation, concerning variances and regression coefficients, have 
been discussed. It is now necessary to undertake a more detailed 
consideration of the problems of estimation. 

Tests of significance and the estimation of parameters are 
fundamentally operations of very different characters, though, 
as will be seen later, the success of a test of significance may 
largely depend on the proper estimation of some parameter, 
without which the hypothesis under consideration could not be 
adequately defined. When an hypothesis is adequately stated, 
it is possible to deduce the probabilities of all the types of 
observation which may be encountered in experiment. After 
this has been done, the probability of obtaining a fit as bad or 
worse than that of a given series of observations may be evaluated 
and used as the basis of a decision as to the competence of the 
h)q)othesi8 to account for the observations. This is a test of 
significance and, in essence, it involves the deduction of particular 
consequences from a general statement. 

Estimation involves the reverse operation, viz. the endeavour 
to construct or amplify a general hypothesis from the material 
afforded by a particular set of observations. This induced 
hypothesis may later be used in a deductive test of significance. 

The essential distinction in the type of argument involved 
may be illustrated by a consideration of genetical recombination. 
If two homozygous individuals who differ by two genes, i.e. have 
the constitutions AABB and aabb, are intercrossed, a double 
heterozygote AB/ab is obtained. Such an individual produces 
four types of gamete, AB, Ab, aB and ab. Of these four types 
two, viz. AB and ab, resemble the gametes which fused to give 
the double heterozygote itself, and may be referred to as non- 
recombinants. The other two types, Ab and aB, are recombin¬ 
ants in that they show changed combinations of the two genes. 
The doss, recombination or non-recombination, into which any 
gamete falls may be discovered by the use of suitable test crosses. 
The relative proportions of the two types of gamete are denoted 
as p : 1-p, p being termed the recombination value. 

203 



204 


STATISTICAL ANALYSIS IN BIOLOGY 


Let US suppose that, by means of suitable crosses, n gametes 
have been tested and found to consist of recombinants and 
non-recombinants. The probability of obtaining such a result 
is, as shown in Section 5, 

VVhen p is fixed by hypothesis, as it would be, for example, if 
the genes were supposedly independent in inheritance, so giving 
p=\i all the possible types of family, with ai=0, 1, 2, &c., and 
a 2 =n, n-1, n-2, &c., could be enumerated, and their probabilities 
calculated for use as the basis of a test of significance. 

The two genes might not, however, be independent in 
inheritance, in which case p could take any value between wide 
limits. It would then be necessary to find the value of p before 
the hypothesis of non-independence, or linkage, became suffici¬ 
ently precise for further use. We are thus faced with the 
problem of estimating p from a knowledge of the value of 

Qildfl 


Superficially this is just the reverse calculation to the previous 
one, but a closer inspection shows that such a simple statement 
cannot be true. In each case there is a single starting-point, 
viz, a single hypothetical value of p or a single observed value of 

71^ 

—r^(p)‘**(l-p)‘**; but whereas, when p is fixed, it is possible 

to enumerate all the types of family and determine the relative 
probability of any one observation or set of observations, it is 
not possible, when a single observation has been made, to 
enumerate all the hypothetical values of p to which it might be 
related. Given the hypothesis, the scope and limits of observa¬ 
tion are determined, but given the observation the scope and 
limits of h 3 rpothesis are not determined. 

Unless this difficulty can be resolved, estimation cannot be 
treated as an exercise in probability. Thomas Bayes devised 
an axiom which, if its truth be granted, would supply the necessary 
basis for such a treatment. Bayes proposed to treat the h 3 rpo- 
thetical population from which the observed sample was drawn 
as itself a sample of a super-population and to assign o pnon 
probabilities to the constituents of this super-population. This 
leads to the so-called inverse probability method of approach. 
The difficulties of the method need not be enumerated here, but 
it may be noted that Bayes himself apparently doubted its value 
and many writers have wholly rejected it since his time. ^ 

The situation has now been completely transformed by Fisher s 
analysis of the question of estimation. He recognizes and accepts 



PROBABILITY AND LIKELIHOOD 


205 


the differences between the calculation of probabilities from 
hypothesis and the estimation of parameters from observation. 

Mathematically the expression —^ function 

ajlas! 

both of the observed number a, and of the hypothetical fre¬ 
quency p. Regarded as a function of p, this expression is the 
probability with which the number a will be observed. Regarded 
as a function of a, it is not a probability. Fisher terms it a 
likelihood function. Probability and likelihood have distinct 
mathematical properties and the special properties of likelihood 
have allowed Fisher to develop a theory of estimation which is 
independent of the theory of probability. 

The most important of these properties of likelihood is that 
relating to the precision of statistics to which the likelihood 
function leads. If from a body of data T be obtained as an 
estimate of the parameter 0, and if in large sample T is dis¬ 
tributed normally with variance then the limiting value of 

^ as n becomes large, cannot exceed a quantity i which is 

nV-r 

defined independently of the estimation of T, The proof of this 
proposition may be found in an article on the ‘ Statistical Theory 
of Estimation ’ by Fisher and need not be set out here. The 
important result for our purpose is that i can always be found 
from the formula 

where m is the class expectation in terms of the parameter 0 and 
8, as usual, denotes summation over all classes. This quantity 
i is an intrinsic property of the data and provides the necessary 
yardstick with which to compare the value of any estimate 
obtained from the observations. 

Fisher has also shown that one method of estimation, viz. 
that of maximizing the likelihood function by adjustment of the 
parameter value, always provides an estimate, jT, which has the 

property that the limiting value of is i. In other words, 

the reciprocal of the variance of the maximum likelihood estimate 
supplies us with a means of assessing the value of other estimates. 

Unless their variances equal i they are not extracting the full 

nx 

amount of information available in the data, and are hence less 
efficient than the maximum likelihood statistic. Such com¬ 
parisons of accuracy may be made in the special case of the 
distribution of estimates of the same parameter even when the 
error curves are not normal. 

14 




206 


STATISTICAL ANALYSIS IN BIOLOGY 


«S. THE METHOD OF MAXIMUM LIKELIHOOD 

Consider a set of n individuals separable into j classes, the 
observed frequencies being a,, o^, and S(a)^n. If 

the proportions expected in the various classes are rrii, mt, 
wij . . . TTij, with *S'(m)=l, m being a function of the parameter 0, 
then the likelihood of obtaining such a family as the one observed 
will be given by the appropriate term in the expansion of the 
multinomial 


(m,+mi+m3+ 


This term is clearly 

n\ 




a.\ 






Estimation of 6 by the method of maximum likelihood 
requires that a value T be found which, on substituting for 6, 
makes this expression a maximum. Maximization is carried out 
by differentiation with respect to 9 and equating the differential 
to 0 in order to obtain the equation of estimation. Such an 
expression as that given above is, however, not very easy to 
differentiate and resort is made to a device for this purpose. 
The expression and its logarithm will both show maxima at the 
same value of T. Hence if we find that value of T which 
maximizes the logarithm of the likelihood we have solved our 
problem. 

The log likelihood is 

( a.!a,!a.r! . . a,! »». 

+ . , . +aj log mj. 

Maximizing this logarithm gives 

d log mj 


dL d log m, 


^logm,_ dlogm,. 

' de • • • • 


+a 


dd 


as log ( ~~ —j— -- \ is independent of 9 and vanishes when 

\ai!a,!a3! , . . ^ 

differentiated. 

The appropriate root of this equation supplies the necessary 
estimate, T, of 9. 

Example 29. As an example of the application of this method 
consider the estimation of the parameter in the Poisson distribu¬ 
tion, already mentioned in Section 12. The variate x may take 
the values 0, 1, 2, 3 . . , j and the frequencies with which obaar- 
vations are expected to fall into these classes are 

<-’• ■■■■•-4 



THE METHOB OF MAXnrCTM LIKELIHOOD 


207 


If the observed frequencies are a©, cti, &c., the likelihood 
function is 




The log likehhood then becomes 

L=G^a^ log (e-^)+ai log (c-'^//)+aa log * * 

where C is the constant which vanishes on maximization. 

This log likelihood expression may be recast in a more manage¬ 
able form because ^ 

log and log (2!)-/', &o. 


Rewriting gives 

i^C+aj log ^+cit log log (/i®) . . . —fla log (2!) 

—Oa log (3!). —dofX—difX—dzf^—^zf^ • • . 

It may be noted that log 2!, &c., are independent of n and will 
vanish on differentiating. Furthermore, as -S(o)=7i, the set of 
terms in n may be replaced by a single term -n/i. Maximization 

then gives 

rf^^^2a^3a^ , . . -n=0 

dfi fj. 

which reduces to ai+2(Z2+3a3+ . * . 

Then 

Now Oi is the observed frequency of the class in which a:=l, 
a* that in which a:-2, and so on. So 

3(^3 

and u is estimated by x, the mean of x, as in Section 12 

Exdmple 30. A numerical example of this type of calculation 
is afforded by Catcheside*s data on the secondary association of 
chromosomes in Brdssica olerdcea. There are three pairs of 
bivalents, each of which may or may not show association at 
meiosis In this way a meiotic nucleus may contain 0, 1, 2 or 
3 pairs of associated bivalents. The frequency with which nuclei 
of these types were observed in certain preparations were 32, 
103 122 and 80 respectively (Table 50). If all three pairs have 
the same chance, p, of being associated, the expected frequencies 

of the four classes will be 

{i-pV; 3ij(i-p)*; 3p^i-p); 

What is the value of p which gives the best fit with the observed 
results ? 




208 


STATISTICAL ANALYSIS IN BIOLOGY 


The log likelihood is clearly 
L=C+32 log [(1-J3)3]+103 log 

+ 122 log [3^^(l-^)]+80 log [p^] 
This expression may be rearranged aa shown in Table 51. 


TABLE 61 


Item 

log p 

log (l-p) 

logs 

32 log [(!-;>)»] . 


96 


103 log [3j?(l-p)*1 . 



103 

122 log 1 

[ZpHi-P)] . 

244 

122 

122 

80 log 1 

[p’l • • 

240 



Total 

687 

1 

424 

225 


This gives 


and 


Zf=C'+587 logp+424 log (l-p)+255 log 3 

^ 587 424 ^ 
dp p \-p 


Then 


587 


587 


424+587 1,011 


=0-580613 


This method of solution would always be used in practice, 
but if the simplification of the log likelihood expression is omitted, 
the data will serve to illustrate the process of solution by 
iteration. This latter is a very valuable method of solving 
complicated estimation equations when no algebraic means exists. 

We have 


L=32 log [(l-p)^+103 log [3p(l-p)*]+122 log [3p *( l-i>)]+80 log 
and 


dL_ 32(3-6p+3p^) 103(l-4p+3p») 122(2p-3p« ) 80(3p^)_ f^ 

dp ( 1 -?))’ * p{i-p)* ^ pHi-p) ^ P* 

which obviously simplifies to 

96 ^ 103(l-3p ) 122(2-3p ) 240_ ^ 
l-p p(l-p) p{l-p) p 

This may be further simplified to the form used earlier, but it 
will be instructive to solve it by iteration as it stands. This 
process consists in substituting trial values of p, finding the 
resulting values of the maximum likelihood expression and 
interpolating for p on the basis of these values. The details 
are ghown in Table 62. 





THE METHOD OF MAXIMUM LIKELIHOOD 


209 


TABLE 62 


The Iteration Solution of the Equation 



96 

103(l-3p) 

122(2-3p) 240 



1-p 

p(l-p) 

p(l-p) 

— V 

P 


p 

0-5 

0-6 

0-58 

0-081 

C-580G 

96 

1-p 

-192-0000 

-240-0000 

-228-5714 

-229-1169 

-228-8984 

103(l-3p) 

p(l-p) 

-206-0000 

-343-0000 

-312-8900 

-314-3662 

-313-7762 

122(2-3p) 

p{l~p) 

244-0000 

101-6667 

130-2135 

128-7961 

129-3632 

240 

P 

480-0000 

1 

400-0000 

1 

1 

1 

413-7931 

413-0809 

413-3655 

Total 

326-0000 

-81-6667 

2-5452 

-1-6061 

0-0551 


--- ' ' - , - 

407-6667 4-1513 


As a first approximation we put ^)=0*5 and calculate the 
value of the left side of the maximum likelihood equation. The 
value is positive, and so the trial value for p was too low. The 
calculation is then repeated with p=0-6, which is found to be too 
high a value, as the remainder is negative. The next step is to 
make a linear interpolation putting 


^= 0 * 5 + 


0-1x320 0000 
326-0000+81-6667 


0-58. 


This value for is a close approximation as it gives a very 
small remainder. It is, however, slightly too low as the remainder 
is positive. In view of the proximity of the true value of p to 
0-68 it is not worth trying p=0-59 and interpolating from the 
difference. Instead we try p*0-581 and find that this value is 
a shade too high. Interpolation then gives 


^ 2-5452+1-6061 


which, as the last column of the table shows, is a very close 
approximation. 

The alternate trial and interpolation can be carried on to give 
as accurate a value for^ as may be desired. Four-figure accuracy 
will serve our present purpose of illustrating the method. 
Solution by iteration has another great advantage in the calcula¬ 
tion of the variance of p, as will be seen below. 

It has been mentioned that the reciprocal of the variance of 





















210 


STATISTICAL ANALYSIS IN BIOLOGY 


a maximum likelihood statistic is equal to m*, where t is the 
amount of information extractable from a single observation of 
the data, irrespective of how the estimation is made. For 
convenience we may put I=ni and refer to 1 as the amount of 
information contained in the body of data at hand. 

Fisher has shown that 

This formula may be used to obtain the variance of p from 
Catcheside’s figures. Table 53 shows how it is applied. 


TABLE 63 

The Calculation of an Amount of Information 


Class 

Expectatioa 

(m) 

dm 

d^ 

i 

0 

1 

2 

3 

(l-p)> 

Zp{\-p)* 

3p*(l-p) 

?>* 

-3(l-p)« 

3(l-p)(l-3p) 

3p(2-3p) 

Zp' 

9(1-7>) 

3(l-3p)« 

P 

3(2-3p)» 

\-p 

9p 

Total 

1 

0 

3 




p(l-p) 


Thus i. 


Zn 


p *p—7T—T and This result 

^)(l-JJ) ^ p{\--p) ^ Ip 3/1 

is not surprising. It will be recalled from Section 13 that the 

variance of p in a binomial expansion is given by Since, 

however, every observation of the present data covers the 
behaviour of three pairs, each of which supplies an independent 
piece of information about p, we have in effect 3n observations. 

So Fp might be expected to take the value 

Zn 

Substituting the estimated value of p, viz. 0*6806, and the 
observed value of n, viz. 337, we find 


and 


Fp--p-0-00024085 

Ip 




THE METHOD OF MAXIMUM UKELIHOOD 


211 


Then 
and we may write 

The formula 

has an identical form, viz. 


5^=VFp=0-015519 


^=0-5806±0-01552. 





ip==-S m 


For 



m 


log m 
dp^ 


1-4 


d^ log m 
dp^ 
d fd log m 




dp 

'f)] 

-s[»(! 


] 

)] 


d^m 1 dm dm 


m dp^ m^ dp dp 
„rd^m 1 /dm\^~\ 

\_dp^ v\dp ) J 

d^m\ „ri /rfm\n 



-< 9 ) 


+S\ - 


Now <Sr(m)=l and so 


rfpV [m\dp) J 

d^\ 


)■” 


dp^J \dp 

Hence the two formulae for i are identical. 

This new formula for i is of value because it is easily related 
to the maximum likelihood expression. The latter is 

dp L dp 

Hence, if we redifferentiate and substitute mn, the expected 
value in any class, for the corresponding observed value, we have 


d^ log m“l . , 


This can be applied to the unsimplified form of the maximum 
likelihood expression from Catcheside’s data, which was 

dL 32x3 103(l-3p)^122(2-3p)^80x3 
dp" \-p 2>(l-y) p{l~p) p 

where 32«ao» 103-ai, 122-a, and SO-a*. Then on redififerentiating 


d^ log m d^L^ f 3g 
** rfp* dp^ \(i“^ 


a 


[3p(l-p)+(l-3p)(l-2p)] 


p)a p2(l-p)2 



212 


STATISTICAL ANALYSIS IN BIOLOGY 


Substituting mn for a, simplifying and changing the sign 


**1 

(2-4^+3j?2)+37i^ J= 


3w 


p(i-p) 


as before. 

This formula relating to the maximum likelihood expression 
is also useful in conjunction with the solution of maximum 
likelihood equations by iteration. is the second differential 
of the log likelihood expression, i.e. is the rate of change on p of 
the first differential, which we term the maximum likelihood 
expression, though always, of course, with mn in place of ci. 
The rate of change of the actual maximum likelihood expression 

is s( ^ equals provided that we are willing 

to accept the observed value, a, in any class as a substitute for 
its expectation, mn. 

Now we can see from Table 52 that when ^=0*580, 

dL cf/ d log 7«\ 

~S{ a —— )=2-5452 

dp \ dp J 

and when p=0-581, —1-6061. Thus a change of 

0*001 inp changes the maximum likelihood expression by 4-1513. 
In other words, the rate of change of the maximum likelihood 

4*1513 

expression on p is —_-4,151*3. This is the value of A. 

0*001 


The calculation of Ip by the direct method gave its value as 
4,151*889, while derivation from the iteration solution gives it 
as 4,151*3. The difference is due to the use, in the latter method, 
of a in place of mn. This can be expressed by saying that the 
value obtained by the direct method is the mean amount of 
information expected from a body of data of this type and size; 
but the value obtained from the iteration solution is the actual 
amount of information yielded by the particular body of data 
under examination. Either value may be used to obtain the 
variance and standard error of p. 


84. INEFFICIENT STATISTICS 

The method of maximum likelihood holds a unique position 
in the theory of estimation by virtue of the fact that it will 
always extract from the data the maximum amount of available 
information. Its limiting variance is equal to the reciprocal of 
ni, where % is the amount of information defined independently 



rNEFFICIENT STATISTICS 


213 

of th© method, of estimation. This is commonly expressed by 
saying that the method of maximum likelihood always leads to 
an efficient statistic. With an infinitely large sample all the 
information is extracted by an efficient statistic, but in a finite 
sample this is generally not the case. Special methods are then 
necessary for the extraction of the residuum of information not 
available by the ordinary estimation process. Sometimes, how¬ 
ever, a statistic does extract all the information that the data 
contain, even in small samples. Such a statistic is termed 
sufficient, and will be found by the method of maximum likelihood 
where it exists. 

In many types of problem other methods of estimation 
suggest themselves and are in some cases fully efficient, in the 
sense that they lead to a statistic having the same variance as 
that yielded by the method of maximum likelihood. Yet no 
other method of estimation has the property of always leading 
to an efficient statistic. Hence an inefficient statistic may easily 
result from the use of some other estimation process. As 
maximum likelihood is firequently not the easiest method to use 
it is worth examining the drawbacks of inefficient estimates. It 
is only by this means that the desirability of always using efficient 
methods, even at th© expense of extra labour, will be appreciated. 
It is assumed that all the methods considered lead to consistent 
statistics. A consistent statistic is simply one which approaches 
more and more closely to the true value of the parameter as th© 
size of sample increases. It will be readily appreciated that 
inconsistent statistics are utterly misleading and should not 
be used under any circumstances. 

Example SI. De Winton and Haldane have recorded the 
results of self-pollinating and intercrossing Primula sineTtsis 
plants that were heterozygous for th© two genes F,f and Ch,ch. 
These genes are linked and the 4,164 individuals observed in the 
progeny of coupled double heterozygotes showed the following 
segregation : 2,972 F Ch, 171 F ch, 190 f Ch, 831 f ch. What is 
the linkage value of th© two genes ? 

Now it can be shown from simple genetical considerations 
that the frequencies expected in each of the four classes are 
i(2+P) F Ch, i(l-P) F ch, J(l-P) f Ch, JP f ch, where 

Ptn being the recombination value of the male gametes and 
Pf that of the female gametes. It should be noted that p^ and 
pf cannot be separated in data of this kind ; only if they are 
assumed to be equal can th© value of p be found from that of P. 
W© will consider the estimation of P as the parameter character¬ 
izing the data. 



214 


STATISTICAL ANALYSIS IN BIOLOGY 


The log likelihood expression is 


L=2,972 log 


2+P 


+ 171 log ^i^Vl90 log 


1-P 


+ 831 


© 


which may be re-written as 

P=2,972 log (2+P)+361 log (1-P)+831 log P-4,164 log 4. 
Then the maximum likelihood equation of estimation becomes 

dL 2,972 361 831 ^ 


or 

and 


—» * - 0 
dP 2+P 1-P P 

1,662+1,419 P-4,164 P2=0 
P=0-824734 


Redifferentiating and substituting expected for observed fre¬ 
quencies gives 

, d’^L nr 2+P 2(1-P) P“| 


d^L_ nP 2+P 5 
n/ I 2 1\ 

4\2+P'^l-P'^pj 


(2+P)a'^(l-P)2'^P2 


13,509-8703 


and 


V. 


0000074020 


and 

^ 2P(1-P)(2+P) 

Substituting the estimated value of P 

/ 4,164x2-649469 r:nQ.« 7 n^ 

^ 2x0-824734x0-176266x2*824734”^ ’ 

and Fp=i=0000074020 

dp 

We may note that if it is assumed that male and female 
recombination values are the same 

P=(l_^)2=0-824734 

1-^?=0*908149 

^=0091851 

In order to arrive at the values of Ip and Vp we put 

, . din dm dP 

but —=—.— 

dp dP dp 

and so )’] 


Ip-nS 


)1 



IKEFFICIENT STATISTICS 


215 


which, since is constant for all the classes in the data, 

dp 


I / m \ A ^ ^ « 


Now 

P~{\-p)^ so ^=-2(l-y) 

Hence 


and 

Vp= j =0-0000224375 

^ r% 


p 


These estimates of P and p are fully efficient, but other 
estimates do not always have this property. One such estimate 
is related to the calculation of testing the linkage item in 
this family. 

In Section 46 we used the comparison 

(ai-Saa-Saa+QUi) 

to test for linkage. If this comparison can be used for the 
detection of linkage it seems not unreasonable to expect that it 
would be suitable for estimating the linkage value. 

Substituting the expected for the observed values the 
comparison becomes 

^(2+Pi-3(1-Px)-3(1-Px)+9Px) 

- 7 i( 4 Pi- 1 ) 

The symbol Pj is used to distinguish the present estimate from 
the maximum likelihood statistic which has already been obtained. 

The comparisons having the observed and expected values 
can be equated to give the equation of estimation of P 

n(4P,-l)-(ax-3a*-3a3-9a*) 

4nPx“(ax-3ajj-3a3+9a4)+(ai+a,+a3+a4) 

and P,=>—(ai-Oj-aa+SaJ 

2n 

Taking de Winton and Haldane’s data 

_L-[2,972-171-190+(5x831)] 

2x4,164 

-0*812439 

This value is distinctly lower than that of the maximum likelihood 
estimate. 



216 


STATISTICAL ANALYSIS IN BIOLOGY 


The variance of Pi can be found by means of the general 
formula, taken from Fisher, 

8 


Now 


So 


and 

Then 
1 




Zn 

dPj 1 dPi dPj 1 dPj 5 


da^ 2n da, da 


271 da^ 2n 

Pi 


<iPi ^ ^ ^ 

dn 2n^ n 


^271^ 4: \ 2n) 4 


n 


4 

1 



2n) 4 


i/_5y^/PA 

\2n) \n) 


16n 


,(2+Pi+l-Pi+l-Pi+25Pi-16PA) 


1 




and 


Fp=^(U6Pi-4PA) 


471 


Substituting the value observed for P, gives 

4x4,164 

This is to be compared with the variance of the maximum 
likelihood statistic, Fp=0*000074020 

Fpj is much greater than Fp. In other words, the second 
method of estimation gives a less precise estimate of the linkage 

value. As Fisher has shown, we can regard /p -^i=5,149*6334 

V 

as the amount of information extracted by the second method 
of estimation. The available information is measured by Ip and 
has already been found as 13,609*8703, so that the estimate 

Pj uses only or 38% of the total information utilizable 

by an efficient estimate. This result is often stated by saying 
that the efficiency, P, of Pi is 

eJ-^=38% 

An equally precise result would have been obtained if the 
method of maximum likelihood had been employed with 

^^100^^* 1*582 plants in the family. Inefficient estimation 



INEFFICIENT STATISTICS 217 

is thus very wasteful. In the present case its use is equivalent 
to throwing away nearly two-thirds of the labour and materials 
used in raising the family. 

The value of E is itself dependent on the value of P. 

T _ n(l+2P) 471 

^ 2P(1-P){2+P) 

Now if P=Ply as must be the case on the average, since both 
are consistent estimates, 

8P(1-P)(2+P) 

Ip (1+6P-4P2)(i+2P) 

Fig. 8 shows the value of E plotted against the value of P. 
It will be seen that P=l, i.e. the second method of estimation 



no. 6 

The efficiency, for all linkage values, of the estimate Oi—Ot+So^), 

hy comparison with the maximum likelihood statistic 


is fully efficient, only when P“0‘25. At this point p^O‘5, i.e. 
the genes are independent in inheritance. So the comparison 
(cti“3a*-3a8+9a4) is fully efficient for the very purpose for which 
it was used in Section 46, viz. the detection of departures from 
independence; but it is inefficient for estimating any linkage 



218 


STATISTICAL ANALYSIS IN BIOLOGY 


value once linkage has been demonstrated. For close linkage in 
either couphng or repulsion, when P approaches 1 and 0 respec¬ 
tively, the method is absolutely useless as its efficiency is nearly 0. 

These conclusions about the estimation of the linkage value 
can be reached without any reference to observed data. They 
rest solely on a consideration of the properties of Ip and /p,. 
It will thus be seen that information is a concept of great value 
in the planning of analyses and experiments. Another example 
will be given later and the subject has been further discussed by 
both Fisher and Mather. 

Not only are inefficient statistics wasteful of data but they 
may also be very misleading. Once found, P and Pi can be 
used to formulate new expectations for the frequencies of the 
four phenotypic classes into which the plants fall. In this way, 
it is possible to test further the hypothesis that the genes F,f and 
Ch,ch are linked. The comparison of observation with these new 
expectations is made by calculating iov 2 degrees of freedom, 
the third degree of freedom having been sacrificed in calculating 
P or Pi, as the case may be. These new expectations are com¬ 
pared with observation in Table 54. 

The maximum likelihood statistic P gives a much better fit 
than does the second estimate, Pj. It is easy to see that in an 
extreme case the use of an inefficient statistic could give a spurious 
disagreement with the data, so leading to the unjustified suspicion 
that linkage was not in itself sufficient to explain the departure 
from simple mendeUan expectation. Inefficient estimates of 
parameters can result in seriously incorrect conclusions and 
should never be used. 

A closer consideration of the difference between the two x*s 
testing goodness of fit after fitting P and P, is of interest as it 
illustrates another remarkable property of maximum likelihood. 


It has already been shown that /p=/j 


general 




fdxy 

W‘ 

When the maximum likelihood estimate of 

P is substituted this expression has the value 0. But if some 
other estimate, Pj, is used, ^ longer 0. It has 

a value which may be called D. Then 


Now the maximum likelihood expression is 


dD d d log m\ c,/ d* log m\ 


a being used in place of the corresponding expectation win. 



INEPFIOIENT STATISTICS 


219 




220 


STATISTICAL ANALYSIS IN BIOLOOY 


Now 

and so 

and 

This variance is fixed 


dP 1 
dD W 
dP 

J = 

by hypothesis, hence 


'y'°J~ ^ ibr 1 degree of freedom. 

■Applyiiig this to de Winton and Haldane^s data 


^^2,972 361 831 2^ 361 831 ,,, 

2+P, 1-P, Pj 2-81244 b^l8766'*’0-81244“^ 


and Ip has already been found to be 13,609-8703, 

(154-86145)2 , 

__ ^1-7752 


Hence 


X\n 


Ip 13,509-8703 


Thus we can expect testing the fit given by observation 
to the expectation found from Pj to be 1*7752 greater than 
that obtained when the TnaT imnni likelihood expectations 
are used. The corresponding difference calculated in Table 64 
is 4*0790-2-2516=l-8276. This slight discrepancy is not due to 
faulty calculation or reasoning but to the fact that, as the values 
of P and P 1 are different, the testing the two gene ratios, 
without reference to the linkage item, are slightly different. 
The segregations of the two genes are correlated when the genes 
are linked. In the extreme case of complete linkage any departure 
from expected ratio of one gene involves a corresponding departure 
in the other. As the linkage becomes looser this correlated 
departure becomes less, and the testing the discrepancies in 
gene ratios alone will become correspondingly larger. Now Pi 
is less than P and so pi is greater than p. Hence the gene 
ratio is slightly larger when the Pj expectations are used. 
This coupled with the x^ testing the discrepancy of the inefficient 
and the maximum likelihood linkage values accounts for the 
1*8276 ffifference of Table 64. Here the x* difference due to 
gene ratio correlation is only 0*0523. The great discrepancy lies 
in the item of 1*7762 traceable to the use of an inefficient estimate 
of the linkage parameter. In this way we see that any departure 
from the maximum likelihood estimate is liable to cause a seriously 
large inflation of the x* subsequently used to test goodness of fit. 
This means, in effect, that 1 degree of freedom in x* Is supposedly 
sacrificed to the estimation of P, and yet, when an inefficient 



8IMUI.TANEOT7S ESTIMATION 221 

subsequent test of significance cannot be trustworthj! ^ ’ 

56. SIMULTANEOUS ESTIMATION 

examples worked above the problem involved the 
estimation of a single parameter. Sometfmes. however two or 

Thn! P“a™eters must be found in order to specify the hyiothLis 

J and / The e,t f.^tribution contains two u'^ow^s; 

^ and <r. The estimation of these two parameters will serve tn 

l'MhSS •".trf by i „.n„d‘ oT 

Example 32. The formula of the normal distribution is 

1 (P-g)* 

- e 




2o» 


the* frequency of the class characterized by 

f diagnostic measurement. Let m, be the 

thrfrt!f‘°" B’nrtlier, let o. be 

so on. ‘ 

J-uen the log likelihood expression is 

‘O' ■ ■ ■ 


This may be re-written as 


Zf-a, 




2<t2 


+a 




• • • 


■a 




0 


2a2 “ 2(j2 

To find the equations of estimation of // and a the log likelihood 

must be maximized by partial differentiation with respect to 
eacn ot these parameters, giving 

dy, 2(72 2cr2 • • - “ 

^■^„_ wg'\/27g ^ ax(/(-Xi)2 a,(//-ar,)2 
<72\/2^ ~a^ 

Equation (i) reduces to ai(^-a:j)+a,(//-a:,)+ . 

8{ayXy)-yS{a^)^^ 

and since ^(ai)-n, i.e.-i, as the numerator is the sum 

c ^ 

of all the X values in the observations. 

15 


-0 


0 or 


(i) 


(ii) 



222 


STATISTICAL ANALYSIS IN BIOLOGY 


Equation (ii) may be simplified to 

-W(j^+ai(/i-a:i)*+a2(/i-a:2)^+ ... -0 

Then substituting x for fx it becomes 

n 

The numerator of this expression is the sum of squares of devia¬ 
tions from the mean of x, just as in the formula used in Section 10. 
The denominator, however, requires a word of explanation. It 
is n and not n-1 as used before. The reasons for the use of w-1, 
the number of degrees of freedom, have been fully set out earlier. 
It will be remembered that if a theoretical mean is used, then 
n is the proper divisor, but that if fx is estimated from the data 
the number of degrees of freedom is reduced to w—1. The sub¬ 
stitution of X for pL in the above calculation, x having already 
been found as the estimate of /f, requires that n~\ be substituted 
for n in the denominator, and so the formula reduces to the form 
used earlier, viz. 

n-1 

The necessity for the change in the denominator at the same 
time as the substitution in the numerator is a consequence of 
the use of small samples. In large samples the difference between 
n and n-1 is so small as to be negligible, 

86. COMBINED ESTIMATION AND HETEROGENEITY TESTS 

The method of maximum likelihood offers a solution to the 
problems that arise when more than one type of data each 
supplies information about the same parameter. Two main 
questions are to be answered in such cases, that of finding the 
best estimate of the parameter when using all the sets of data 
together, and that of testing the agreement of all the data with 
each other. A detailed discussion of these problems has been 
given by Mather, but the salient features of the treatment can 

be illustrated by a very simple example. 

Example 33. Suppose we have an organism heterozygous tor 
a single gene difference. A,a. The segregation of this gene may 
be investigated in either or both of two ways, viz. by sel^ 
fertilizing or intercrossing such heterozygotes to give an F* an 
by backcrossing the heterozygote to a recessive individual. 

Let X be the proportion of a gametes, and 1-a; corresponding y 
the proportion of A gametes, which are successfully represente 
in the next generation, x being alike for both male a-nd fema e 
gametes. It is expected that the two phenotypic classes. 
A and a, will occur with the frequencies 1-a:: x in the backcross 



COMBINED ES-miATION AND HETEROGENEITY TESTS 223 

nf f observed frequencies of individuals 

ol the two phenotjqjes in backcross and F, may be denoted as : 


Type of progeny 


Total 


Backcross .... a»> 

*. ^F\ aF2 

Three questions may be asked, viz. 

(а) What value of a- best fits the joint data ? 

(б) Does this agree with the simple mendelian expectation 

Ox 2/“ A ? 

(c) Do the two families agree in showing the same value of a: ? 
The likelihood of obtaining the observed backcross family is 

and that of obtaining the is 

Since the two families are quite independent of each other the 
likelihood of their simultaneous occurrence is the product of the 
two individual likelihoods, viz. 

and the joint log likelihood becomes 

X-C+afl, log (l-x)+as 2 log (x)+api log (l-x^)+ap ^ log (x^) 

The best-fitting joint estimate of x is then obtained by 
maximization. 

^B\ ^ Qi?2 _ 2x,gpi ^ 2ar.a^^2_Q 

dx 1-x X l-x^ x^ 

As a numerical example the data of Fisher and Mather on 
segregation for D,d in mice may be used (Table 55). 

TABLE 65 

Segregation for Intense (£)) and Dilute {d) in Mice (FisAcr and Mather) 


Phenotype 


IFamUy 


Total 


Backcross , , , 

F. . . . 


t < 


648 

132 


671 

66 


1.219 

188 


4-8638 

2-2979 


9 


224 STATISTICAL ANALYSIS IN BIOLOGY 

Substituting these figures in the maximum likehhood equation 


648 571 2a;.132 2a;.56 . 
-+--+- 


1-a; a; l-x^ 

which simplifies to 683-648a:-l,595 a:^=0 
and a;=0-482049 

The amount of information about a: may be found from 

either of the two formulae 

. „ri /dm\n r d^ogmi 

If the second formula is used, it is clear that the second derivative 
of the joint log likehhood expression is the sum of the seoona 
derivatives of the two log hkeUhoods given by backcross and 
F, separately. Hence the total amount of information about 

X is the sum of the amounts yielded by the two ° 

when taken separately. Once this has been established it i . 

however, easier to find the information value by usmg 0 

formula. 


ClMa 

Backcross 

m dm/d^ i 


1 

A 

1—X —1 ^1— 


l-x 


, 1 

a 

X 1 - 


X 


1 

Total 

^ x(I-x) 


F, 


m 


dm/dx 


l-x 




~2x 


2x 


4®* 

l-x* 

4x> 


X’ 


4 

I-x* 


The total amount of information is thus 


riB 


4n 


a;(l-a;) l-JC* 


1,219 ^ _ilil5?_«4,882*29+979‘64 


’0*269678 0*767629 
5,861*93 


and 


Wj 


This method of combining different groups of 

one estimate of the parameter is worth examining gbove, 

rsely. Let £ be the best joint f‘““^e .°f " “ ^ 
and xg and x, be the best estimates supphed by the two o 



OOMBINBD ESTIMATION AND HETEROGENEITY TESTS 225 


of data taken separately. It has already been shown (Section 53) 
that the rate of change of the maximum likelihood expression 
on the parameter is the amount of information yielded by the 
data. Hence provided that {^-Xb) is small, where 

■Ub is the value of the backcross maximum likelihood expression 
when is substituted for Similarly Now the 

jomt maximum likelihood expression has the value 0 when ^ is 
used, i.e. So 


and 


Jb{^ x^)=— 

Ib^S^^F^P 
f D+f D 


The b^t joint value is the mean of the separate values weighted 
according to their respective amoimts of information. In general 

and so the reciprocal of the variance is the weight to be 

applied to any statistic when it is being combined with other 
estimates of the same parameter. The procedure used in com¬ 
bining different estimates of a correlation coefficient in Section 41 
18 of general validity. 

We now turn to the problem of testing the joint agreement 
of the backcross and F* data with the simple mendelian expecta- 
tion of a:=J. If ^ is substituted for x in the joint maximum 
hkelihood expression the latter takes a value D„. We can also 
calculate the amount of information, yielded by the joint 

data about x when the latter has this value of As 


D 


18 


a for 1 degree of freedom (Section 54) the appropriate test 
of significance is then easily made. 

The joint maximum likelihood expression is 


dL 571 648 2a;, 56 2a;. 132 

+ - 


X 


1-a;* 


dx X 1-® 
which, when becomes 

(2x671)-(2x648)+(4x66)-(|x132)=-106=Z), 


and 


I - 


4n 


®(1-®) 1-x* 


(4x1,219)+(4x4x188)-5,878 



Hence and the joint data agree with 

o,878*o 

mendelian expectation. 

The last column of Table 65 gives the values of the two x^n]*^ 
testing the agreement of the separate families with this expectation 
of avj. In the former case the segregation expected is 1 ; 1 and 
m the latter case 3 : 1. The total is 7*1617, which may be 
analysed into deviation and heterogeneity items as shown in 


226 STATISTICAL ANALYSIS IN BIOLOGY 

Table 56. The deviation item has been found above and the 
heterogeneity item is found by subtraction. 

TABLE 66 

The Analysis of Fisher and Mather^s Data 


Item 

X* 

N 

P 

Deviation . • . . 

1-9113 

1 

0-20-0-10 

Heterogeneity . 

5-2504 

1 

0-05—0*02 

Total 

7-1617 

2 


The heterogeneity x'^ 

is suspiciously large and so there is doubt 


WUCtllCl ullv VWV OUVO VA ^ —- 

heterogeneity is not sufficiently pronounced to cause any serious 
disturbance, but it is doubtful whether the two sets of data can 
properly be combined in the estimation of x. 

This analysis gives the heterogeneity item appropriate to 
and, if the deviation were significant, this item would be 
inexact. Though the calculation of a more exact heterogeneity 
item is not necessary with the present data, it may be undertaken 
as an illustration of the method for use in other cases. 

The best-fitting joint estimate of x, viz. is substituted in 
the two separate maximum likelihood expressions, from backcross 
and Fs, which then take the values of Dg and Dp. The two 
separate amounts of information, Ig and Ip, are calculated and 


T) ^ T) ^ 

-.2 
X 


ig ip 

It should be noted that this would have 2 degrees of freedom 
if an expected value of x were used, but where the estimate 
^ has been found from the data themselves 1 degree of freedom 
is, as usual, lost. 

5=0-482049 

r. 671 648 

Dg 


0-482049 0-517951 


-66-55 


^ 66x2 133x2x0-482049 

Jjp^ ---“DD’OtJ 

* 0-482049 0-767629 

r 1.219 


0-482049x0-517951 
4x188 


4,882*29 


0-767629 


*979-64 


^8^2^29 °5-4294 with a probabiUty of 0-02. 


PIiANOTNO EXPERIMENTS 


227 


67. PLANNING EXPERIMENTS 

The concept of the amount of information is very helpful in 
the understanding and planning of experiments. It is very 
commonly the case that some aspect of experimental technique 
is capable of being modified in such a way that the resulting data 
are to some extent altered. A comparison of the amounts of 
information about the parameter in question will quickly show 
which of the various procedures is most profitable. 

As a case in point we may consider the biological assay of 
drugs. The dose administered may be chosen so that any 
proportion of the test individuals show the reaction, death, 
convulsion, &c., characterizing the drug’s effect. Fisher has, 
however, shown that the most precise assay is made when 50% 
of the individuals react, as it is here that the quantity of informa¬ 
tion about drug strength is greatest. So, in making such assay, 
the drug should always be administered at approximately the 
strength necessary to give 50% effect. 

Though perhaps of rather specialized interest, the estimation 
of genetical recombination values provides a very striking example 
of the use of information in the design of experiments. It has 
already been stated that an individual heterozygous for two genes, 
i.e. of the genotype AaBb, gives four classes of gamete, AB, Ab, 
aB and ab, with the characteristic frequencies 

‘ ip‘ip • l(^-p)y or Jp : i(l-p) : i{l-p) : ip, 

p being the recombination value, according to whether the 
individual itself arose from the fusion of AB and ab or of Ab 
and aB gametes. The former type of heterozygote, symbolized 
^ AB/ab, is usually said to show the coupling phase of linkage, 
the latter, Ab/aB, the repulsion phase. When the genes are 
unlinked p~l-p=i and the two phases cease to differ. 

Suppose that it is desired to estimate the recombination 
value. It is necessary to devise some technique for determining 
the relative frequencies of the non-recombinant and recombinant 
gametic classes, i.e. AB and ab as opposed to Ab and aB in the 
coupling case and vice versa for repulsion. Several experimental 
courses are open for this purpose, but only two of them are in 
common use. First of all, the double heterozygote could be 
crossed to a double recessive aabb. Four classes of progeny, all 
phenotypically distinguishable, are produced, viz. AaBb, Aabb 
aaBb and aabb. These classes have the same expected fre¬ 
quencies as the four corresponding types of gamete given by the 
double heterozygote. Such a backcross would normally be made 
in animals, provided, of course, that aabb individuals wore 
available. One type of mating is as easy as another in bisexual 
organisms. But in plants self-pollination may be much easier 



STATISTICAL ANALYSIS IN BIOLOGY 


228 

than backcrossing and so would commend itself for practical 
reasons. On selfing, ten genot5rpe3 are produced in the Fi. 
Table 57 gives the frequencies of these classes for the coupling 
phase, assuming that p is the same in male and female gametes. 


TABLE 67 

The Progeny of a Selfed Doable Heterozygote 



AA 

As 


BB 


2p(l-p) 

p* 

Bb 

2p{\-p) 

R 2p» 

C 2(l-p)> 

2p(l-p) 

bb 

P' 

2p{\-p) 

{l-p)» 


R=repulsion phase C=coupling phase 


The repulsion phase gives similar results but with p and \-p 
interchanged. In the common case of dominance of both genes, 
only four classes are distinguishable on the basis of phenot 3 q>ic 
differences. These classes are indicated by dotted lines in the 
table. Simple addition then shows that the frequencies of the 
four classes are 


AB 


Ab 


aB 


ab 


Coupling 

Repulsion 


K2+(l-p)»] 

i{2+p*) 




Ki-(i-P)*] 




Which of the two methods, backcrossing or self-pol^- 
ating, is to be reco mm ended for deter minin g the recombination 

value ? 

The first step in answering this question is that of finding the 
relative amounts of information about p yielded by the two types 
of progeny. These are easily obtained from the formula 

as shown in Table 58. It will be noticed that », 




”1 /dm\®" 
_rr\dp) * 


rather than /, is used for this purpose, so obviating any confusion 
due to the question of family size. The result would, however, 
be the same if / were employed, provided that backcross and 
Fj were of the same size. , 

The information from both types of family varies with p, 
so it is necessary to take one as a standard for comparison with 
the other. The backcross information is the same in hot 
coupling and repulsion and hence is to be preferred as the standar . 
If the value of the backcross is taken as unity, the relative value 







PLANNING EXPERIMENTS 


229 


TABLE 68 

Amounts of Information about the Recombination Value 


Backcrosa 


Class 

Coupling 

Repulsion 


dm 


m 

dm 

% 


m 

dp 

% 

dp 


AB 

4(i-p) 

-4 

1 

2(1-P) 

4p 

4 

1 

2p 

1 

2(l-p) 

1 

2(1-P) 

Ab 

1 

4p 

ff 

1 

2p 

4(i-p) 

-4 

sB 

4p 

4 

1 

2p 

4(i-P) 

-4 

ab 

4(i-p) 

-4 

1 

2(\-p) 

4p 

4 

1 

27 

Total 

1 

0 

1 

P(l-P) 

1 

0 

1 

P(l-P) 


Fi 


A' t 

Class 

Coupling 

Repulsion 

dm .• 

m -j- * 

dp 

dm 

"* Tf 

AB 

Ab 

aB 

ab 

i(3-2p+p*) -I(l-P) 

(1-pV 

i(2p-p>) i(l-p) 

(1-P)* 

i(2p-p>) id p) 

i(l-2p+p*) -l(l-p) 1 

, P’ 

J(2+p>) iP 2+p> 

iP l^p> 

i(l-p') iP i^p> 

iP* 4p ^ 

Total 

2iZ-4p+2p*) 

2(l+2p-) 

* (2+p*)(l-p’l 

* p(2-j))(3-2p+j>0 


of the F„ obtained by dividing the F. amount of information 
bv the hafikerosa amount. —r::- r» 


by the backcross amount, 

^ 2(l-p)(3-4p+2p>>) 

~Wp){fWT 

These two fractions are perhaps most easily 

the relative value of the Fi is plotted agains P , i^gg 

The F. is as valuable, plant for plant, as the backcross m close 


_ 2p(l+2p*) 

Repulsion 



230 


STATISTICAIi ANALYSIS IK BIOLOGY 

coupling, but its value falls away to 0 in close repulsion. When 

the F, gives four-ninths of the precision of the backcross. 

bo \^en backcrossmg is as simple an operation as self-pollinating, 

the IS only recommendable in case of close linkage in thecoup- 

hng phase In close repulsion an F, is relatively worthless, and 

even for the detection of loose linkage it has less than half the 
value of the backcross. 

If, however Fs seed is more easily produced than backcross 
material, the advantage in precision of the latter may be offset 



Repulsion I Coupling 
RecoMBinATw/^ FfuKCTwnfpJ 


rru OS • -riu. 

of the^tcombi^at^^ progenies, for the estimation 


rilnfoT. ir ^ progenies of the former type. The experi- 

f V material, can form an opinion on the 

pm a w 1 C the ease of production of an Fj, is overborne by 
1 poorer precision. If, for example, a given amount of labour 

m twice as big an Fa as backcross, it would be 
pro a ® o est almost any linkage in the coupling phase by 
means ® ® *> always provided, of course, that no information 

was soug on the equality of male and female recombination 
values^ Repulsion F.'s would stiU be unprofitable. 

Jieiore leaving this example a further point may be made. 


FIBUOIAL PROBABILITY 


231 


An F, classified fully into all 10 genotypes, as given in Table 57, 
can, by the calculation of i, be shown to contain twice the 
information of a backcross. In other words, the incompletely 
classified Fg never yields more than half the amount actually 
present in the family, and usually gives a much smaller fraction 
still. Incomplete separation of the classes means loss of informa¬ 
tion. So it follows that classification should always be as com¬ 
plete as is practicable. In the present case, however, completing 
the classification is very time and labour consuming and cannot 
be undertaken with profit except in very special circumstances. 


68. FIDUCIAL PROBABILITY 

A parameter is estimated for the purpose of completing some 
hypothesis related to the phenomenon under consideration. 
The hypothesis that two genes are linked is, for example, of very 
restricted use until the recombination value has been found. 
Once the parameter has been estimated, on the other hand, the 
hypothesis of linkage may be tested for agreement with any set 

of data concerning the genes in question. 

We have seen that inefficient estimation leads to trouble 
when further tests of significance are based on the resulting 
statistic. But even when an efficient statistic has been found 
it must be used in a proper manner, or false conclusions \vill be 

drawn. 

The point is well illustrated by one of the commonest opera¬ 
tions in the realm of statistical analysis, viz. the testing of a 
deviation when the variance with which it is to be compared is 
estimated from the data. The formula 


S{x-xy 


n-1 


supplies not merely an efficient but a sufficient estimate of a, , 
the true variance of the distribution of To test the significance 
of any observed deviation of x from its expected va ue i is 

neeessai’y to compute the ratio where found as a/F^, is 

the sufficient estimate of a,. The* whole of the care devoted to 
the calculation of a, is wasted if it is then forgotten that tto 

precision of is dependent on the number of 9 . . 

available for its estimation. Fisher has shown that this precision 

is, in fact, ~ of where n is the number of observations. 

n+3 


To treat ^ as a normal deviate, i.e. to treat s, as itself, 
rather than as an estimate of a** is to assign to the calculation 



232 


STATISTICAL ANALYSIS IN BIOLOGY 

of a spurioiwly high precision. There is then a real danger 
ot overestimating the significance of the deviation under test. 
jUt to treat the fi-action as a i for the appropriate number of 
degrees of freedom is to give it its proper weight. Its true 
si^ficance then be apparent. This quantity < is derived 
solely from the data at hand and it carries no unproved, and 
unprovable. implications about the hypothetical nature of the 
population from which the sample is supposedly drawn. It is 
known exactly and its distribution is also known exactly. It 
permts the experimenter to draw rigorous conclusions. 

The two recent great developments in statistics are (i) the 

calc^ation of the distributions of the various exact tests of 

sigmficance, as described in Chapter IV. and (ii) the demonstration 

that data contam a definite amount of information about any 

parameter upon which they depend, and that this information 

can always be extracted by suitable methods of estimation. 

Statistical methods developed from these two findings lead to 

conclusions which are based solely on the data and which are as 

precise as the data wiU aUow. In this way proper statistical 

analysis leads to rigorous inferences. These are of necessity 

stated m terms of uncertainty, but the degree of uncertainty is 
known and can be given. 

The ratio t, for example, usuaUy takes the form 

Of the four quantities involved, three, t. a, and *, are known 

may be used for a purpose other than 
that of testing the deviation of x from some hypothetical value a. 

It may equaUy weU be written as 

Wlien given probability levels are assigned to t, this form of the 
equation allows us to state rigorously, on the basis of the avaUable 
observations and with no resort to hypothesis, that u lies between 

e evels and x~tpS^ with a probability of p, where is 

the value of t at that level of probabiUty. ^ 

This approach is of general validity. In the example of 

the uptake of Rb ions by potatoes, 

= 985, 5j= 0'007186 and t for 3 degrees of freedom has 

the value 3-182 at the 0 05 level of probability. Then B, the 

which b is an estimate, lies between the values 
0-1730+(3 182x0-007186), i.e. 0-1959 and 0-1730-(3-182x0 007186), 
i.e. 0*1501 with a 5% probability of error. Such a statement is 
one of Fiducial Probability. It is a property of those data which 
are its origin and it is rigorous because sound statistical methods 
have been used in the analysis of the data. Only by the use of 


FTDirOIAL PROBABILITY 


233 


Buoh analytical methods can the experimenter avoid the twin 
dangers of over-assessment and under-assessment of the meaning 
of his results. 


REFERENCES 

OATCEBSIDE, D. G. 1937. Secondary pairing in Brassica ohracea. Cyto- 

loguif Fujii Jub. Vol., 366-78. t> i 

FISHER, R, A. 1937. The Design of Experiments. Oliver and Boyd. 

Edinburgh. 2nd ed. 

- 1938. Statistical Theory of Estimation. Calcutta University Reader- 

ship Lectures. . . . n 

'■ and MATHER) H. 1936. A linkago tesfc with mice. Ann* Eugen,f 

266-80. 

MATHER, K. 1935. The combination of data. Ann. Eugen., 6, 399-410. 

- 1938. The Measurement of Linkage in Heredity. Methuen. London. 

WDJTON, D. DB, ond HAI.DAKE, J. B. s. 1935. The genetics of Primula 
sinensis —III. J. Oenet., 31, 67-100. 



CHAPTER XIII 


SOME TRANSFORMATIONS 

59. THE ANGULAR TRANSFORMATION 

WHEN an object may fall into either of two classes, the proba¬ 
bility of it falling into one being p and into the other q (=l-p)> 
and n such objects are observed, the probabilities of finding 
all n to be of the first type, n-\ of the first and 1 of the second 
types, and so on, are given, as we have seen in Section 5, by the 
expansion of the binomial expression (p+g')^. If a series of such 
groups of n objects are observed the mean proportion of objects 
in the first class is expected to be p and the variance of the 

proportion to be — (Section 13). Thus the precision with which 

we determine p is a function of p, as well as of n, so that even 
if 71 is held constant, the variance will not be constant when 
p itself varies. 

This means that we cannot combine proportions by simple 
averaging, since each proportion will be determined with 
a different precision and hence should be weighted by the inverse 
of its variance in any combining operation. Similarly we are 
debarred from the use of such a simple and powerful technique 
as the analysis of variance in dealing with data consisting of 
proportions as this involves the assumption that each datum, 
in this case each proportion, is subject to the same variance. 
Over small ranges of p, especially near p=0*5, this assumption 
may be sufficiently nearly true for the errors arising there¬ 
from to be negligible ; but over wider ranges of p the available 
analytical methods must be seriously limited by the dependence 
of Vp on p. 

The angular transformation, which consists of replacing 
p (or q) by an angle such that p= sin^ is of great value in 
overcoming this handicap. For 



Now 


^ A o • sin , , 

___sm>= ,^=2 sm sm ^ cos 4 , 


when ^ is measured in radians. Furthermore 

g- l-p= 1-sin^ tf>=cos^ <f> 

2U 


THE ANGULAE TKANSFORMATION 


235 


Therefore Ix=^ . . —^(2 sin cos 

^ sin* <l> cos* 9 

and is independent of the value of 

When, as is usually more convenient, <j) is measured in degrees 


and 


^ sin ^ cos 

j 47171 * 71 


F^ then becomes 


820-7 


71. 


Hence the angular transformation replaces proportions by 
angular values which can be treated by simple averaging, the 
analysis of variance, and similar methods. 

Example 34. The value of the angular transformation may 
be iUustrated by the analysis of an experiment, which formed 
part of an investigation into the precautions necessary to ensure 
effective isolation of varieties in the production of seed or prop 
plants. It was desired to know the amount of inter-varietal 
cross-pollination occurring between two varieties of the radish 
when no spatial isolation was attempted. Fifty plants of each 
of the two varieties, one white rooted and the other re roo e , 
were grown in a souare of ten plants side, the represen a ives o 
the two varieties being randomly sited within t e square. 
Radishes set no seed with their own poUen and so aU seed comes 
from crossing with other plants. The chance of the poUen ^am, 
effective in giving any one seed, coming from a plant ot the 
opposite variety to the mother, is rather greater than 0-6, since 
there are 60 potential fathers of the opposite kind to 49 ot the 
mother’s own variety, and even the pollen of some o ese 
49 may be ruled out by the same mechanism as prevents effective 
self-pollination. To avoid complications, however, m our lUus- 
tration we will assume expectations of 0-5 proba i y o e 
effective pollen coming from the mother plant s own vane y, 

and 0 6 of it being from the other ^ 

30 seedlings were grown from the seed of each o ^ t + 

20 red-rooted mothers taken at random from the square. I^r- 
varietal crosses were recognizable by the resulting see g S 
purple pigment. Table 59 shows the numbers oi 
resulting from inter-varietal crosses in each ® ^ j 

Do these data accord with the expectation of 50,4 m 
crossings, and if not, do the two varieties agree m the percentages 

they show ? 



236 


STATISTICAL ANALYSIS IN BIOLOGY 


TABLE 69 

Frequency of Contamination by Foreign Pollen in Radishes 


No. of hybrida 
(io seedlings) 


1 

0033 

3 

0*100 

4 

0*133 

5 

0*167 

6 

0*200 

7 

0*233 

8 

0*267 

9 

0*300 

10 

0*333 

11 

0*367 

12 

0*400 

13 

0*433 

14 

0*467 

15 

0*500 

16 

0*533 

17 

0*567 

18 

0*600 

19 

0*633 

20 

0*667 

21 

0*700 

22 

0*733 

23 

0*767 

26 

0*867 


Frequency observed in 
^ in • White series Red series 


10-6 0 1 

18-4 0 1 

21-4 0 1 

241 1 1 

26-6 0 2 

28-9 1 1 

3M 1 0 

33-2 1 0 

35‘3 1 2 

37-3 2 0 

39-2 3 0 

412 1 3 

431 0 2 

450 1 2 

46*9 2 0 

48-8 0 1 

60-8 1 1 

62-7 2 0 

64-7 1 0 

66-8 1 0 

68-9 1 0 

611 0 1 

68-6 0 1 

20 20 


The value of may be found from the proportions of hybrid 
seedlings either by the use of the relation p=sin2 or, more 
easily, from Fisher and Yates’s table of this transformation. 
Here, where the observations are on 30 individuals in each case, 
Fisher and Yates’s Table XIII relating ^ to the actual numbers 
of observations in the two classes may be used, so obviating the 
calculations of the proportions. The sum of ^ in the white 
series is 851*4, giving a mean of 42*57, the sum in the red series 
^ Saving a mean of 37*81, and the overall sum is 1,607*6, 

giving a general mean of 40*19. Thus in both series the mean 
is below 45, which is the value of ft corresponding to the expecta¬ 
tion of p=»0*6. 

The values of may be subjected to an analysis of variance. 
Taking the expected ^=45 as the mean from which deviations 
are calculated, each value of ft will contribute 1 degree of freedom, 
giving 40 in all. Of these, 1 will clearly be concerned with the 
deviation of the general mean of the observations from ^45, 

I will be concerned with the difference between the means of 
the white and red series, while 38 will be concerned with the 


THE ANGULAR TRANSFORMATION 237 

variation of <j> within the two series, each of which will contribute 
19 degrees of freedom. 

In obtaining the sum of squares of (j> it is easier to use the 
working mean of 0, and perform the analysis as though it com¬ 
prised only 39 degrees of freedom, the one concerned with the 
general deviation from the expectation of 45 being omitted. 
In this way we find 

S{^^)^10,510AQ0 and -iS2(,i)=J-(l,607-6)2==64,609-444 

n 40 

leaving 6,901'016 as for 39 degrees of freedom. Of this 

total yV(851-42+756-22)-64,609-444 or 226-576 is ascribable to 
the degree of freedom concerned with the difference between the 
white and red series means, leaving 6,901-016-226*576 or 
6,674-440 for the 38 degrees of freedom concerned with the 
pooled variation within the two series. On finding the sum of 
squares for each series separately, round its own mean, it appears 
that this pool contains 1,777-482 from the white series and 
3,896-958 from the red. 

The sum of squares for the deviation of the general mean 
from 45 is found as ^(1,800-1,607-6)2, 1,800 being the expected 
total of (f} and 1,607-6 that observed. This reduces to 925-444. 
The full analysis of variance may then be set out as in Table 60. 

TABLE 60 

Analysis of Variance of Radish Data 


Item 

Sum of 



Mean 



Squares 

N 

X* 

Square 

1 

Probability 

Oeneral Deviation 

926-444 

1 

33 826 

025-444 

2-491 

002-001 

Difieronoe of W and 







R Means 

220-676 

1 

8-281 

220-676 

1-232 

0-3-0-2 

Variation within 






Boriea . 

6.674-440 

38 

207 390 

149-327 



Total . 

6.820-4CO 

40 





White ( Deviation 
^riee \ Heterogeneity 

118-098 

1 

4-316 

118-098 

1-123 

0-3-0 2 

1,777-482 

19 

64-907 

93-562 



r Deviation 
eeriee \ Heterogeneity 

1.033-922 

1 

37-790 

1,033-022 

2-246 

005-002 

3.896-958 

19 

142-433 

206-103 




Now (j) is, as we have seen, subject to a theoretical variance 

f 820-7 

or, in this case, with progenies of 30 seedlings, 27-36. 

7 % 


Therefore, if each sum of squares in the analysis is divided by 
27-36, a will be obtained testing the significance of that item. 
When this is done every item is found to have a significantly low 
probability; and, in particular, that for the variation within 
series has a very small probability of occurrence by chance. 
We must, therefore, abandon the use of the theoretical variance 



238 


STATISTICAL ANALYSIS IN BIOLOGY 


and use the mean square for the variation within series as an 
estimated error. The mean squares may be obtained by dividing 
either the sums of squares or the values by the numbers of 
degrees of freedom. A constant factor difference of 27*36 is 
involved between the two methods, and this vanishes in calculat¬ 
ing t for the test of significance. In Table 60 the mean squares 
have been found from the sum of squares. 

The general deviation from the mean and the difference 
between the white and red series means are then tested by finding 
t as the square root of the ratio of the appropriate mean square 
to the error mean square. The general deviation gives a 
<(38 j= 2*491, which may be entered in the table of c since iV'>30. 
It has a probability of 0*02-0*01 and so affords good grounds 
for deciding that the general mean is below the expected value 
of 45. In the case of the difference between the white and red 


series means f[38^=l*232 and this is not significant. The series 
do not, therefore, differ in so far as the present evidence goes. 
We may, therefore, take a combined mean for both series. 
This has already been found as 40*19 or, reconverting to 
proportions, p=0*418. 

The lower part of Table 60 shows the test of deviation from 
the expected value of 45 as applied to each series separately. 
It will be seen that the mean square error for the white series is 
only 93*5517, while that of the red series is 205*1031. These 
give a variance ratio of 2-192 for and the table of 

variance ratios shows the probability of obtaining such a difference 
due to chance to be only about 0*05. There is, therefore, some 
suspicion that the internal variations of the series differ; but 
since there is no obvious reason for expecting such a result and 
since the significance is only at the suggestive level, it might 
reasonably be assumed that the apparent departure was fortuitous, 
until further investigations, which are clearly desirable, provide 
unambiguous evidence of a difference in variation. 

If, however, the deviations of the white and red series means 
from the expectation of 45 are tested against their respective 
intra-series mean squares, the deviation of the white series mean 
is insignificant, and even that for the red series is barely significant. 
The advantage of the joint test described above is clear. The 
series means are shown by it not to differ from one another, and 
so are capable of being tested for their joint deviation from 
expectation. This is more significant than either single deviation 
when tested alone. 


It may be noted that the sum of squares for general deviation 
and that for difference of white and red means in the upper 
analysis of Table 60 together equal the pooled sums of squares 
for the separate deviations of the series means in the lower 


THE PBOBIT TRANSFORMATION 


239 


analyses. The advantage of the joint test arises, therefore, from 
the more informative partition of the sum of squares in the upper 
analysis as compared with the lower. This freedom to adopt 
the more valuable partition depends on the use of the analysis 
of variance technique, which, as we have seen, is made possible 
by the angular transformation. Without this transformation 
the available anal 3 rtical methods would have been less informative 
as well as more cumbersome. 

One further value of the angular transformation may be 
noted in passing. Since the values of and are independent 
of itself, any required level of precision may be secured merely 
by adjustment of the number of observations or individuals. 
In planning an experiment we can thus decide at once the number 
of observations necessary for the attainment of any given level 
of precision in the results, or for the detection of deviations of 
a given magnitude at a given level of probability. This is 
impossible, of course, if the results are to be expressed and tested 
in terms of the untransformed proportion p, where the variance 
is dependent on the value of p itself. 

The benefits accruing from the angular transformation of 
proportions may be secured also with data which take the form 
of Poisson series. As we have seen in Section 12, the Poisson 
series is completely described by its mean to which its variance 

is equal. The variance of the mean, f, is therefore where 

n 


n observations are used. Then if we put y-Vx, —*2vf, and 


I^i/’“-=(2V5)2=47i. The square root of the mean of a Poisson 

X 

series has, therefore, a variance independent of its own value. 


60. THE PROBIT TRANSFORMATION 

The transformation of a variate may also be of value in 
giving a simpler relation with a second variate than does the 
untransformed form. Thus, as we have seen in Section 31, the 
compoimd interest law, relating growth to time, leads to a 
logarithmic transformation, since if growth follows this law the 
relation between log size (or weight) and time is linear and 
therefore easier to manage statistically than that between 
untransformed size and time. The probit transformation is 
widely used for this purpose in toxicology, biological assay and 
similar work where response is quantal, i.e. all or none. 

In order to see what this transformation is, let us consider 
the t^e of data which such experiments yield, A number, w, 
of animals are exposed to a poison and s of them survive, the rest 


240 


STATISTICAL ANALYSIS IN BIOLOGY 


dying. First of all this implies variation in the individual’s 
capacity to resist the poison. The variation may be genetic or 
it may depend on fortuitous circumstances in the administration 
of the poison, or it may arise from both causes ; but in any case, 
all those whose individual lethal dose is below that which they 
receive will die, and the rest will survive. If the frequencies of 
individuals requiring the various possible lethal doses are normal 
in the population from which the test animals are taken, an 
ordinate at the point on the abscissa representing the dose in 
question divides the curve into two parts, whose areas are 
determined by the distance of that point from the mean when 
expressed in terms of the standard deviation, i.e. expressed as 
a normal deviate. Furthermore, the area to the left of the 
ordinate gives the proportion of animals of lethal dose below the 
dose given, i.e. those who will die, and the area to the right of 
the ordinate the proportion of animals of lethal dose above that 
administered, i.e. those who will survive (Fig. 10). 

Now, if we plot the proportion of animals dying, i.e. the area 
of the curve on the left of the ordinate, against the dose we shall 
clearly obtain a sigmoid curve (or ogive as Galton called it) 
relating mortality to dose. But if we plot the normal deviate 
which corresponds to the proportion dying, in the sense that 
the ordinate at this normal deviate cuts off an area of the correct 
proportion under the normal curve, we must obtain a straight-line 
relation to dose ; for we are in fact replacing the proportion of 
deaths by a kind of hypothetical dose whose relation to the dose 
of poison known to have been administered we can now investigate. 

In practical toxicology and biological assay the frequency 
distribution of individual lethal doses is generally not normal 
when the actual dose of poison or drug is used as the abscissa. 
It is, however, normal, or very nearly so, when the logarithm of 
the dose of poison or drug is used. A double transformation is 
therefore used to obtain the desired linear relation, viz. that of 
the dose into log dose to give a normal distribution and that of 
proportion of deaths into normal deviates to remove the sigmoid 
shape of the integrated normal curve. This log transformation 
of dose is not, however, an essential part of the treatment. It 
is an addition, empirically known to be desirable in most cases, 
though not in all. With any new poison or drug, or any new 
method of preparation or measurement, the use of the log trans¬ 
formation requires rejustification. If it should prove unsuitable, 
and the actual scale used in measuring the dose a dmini stered is 
also unsuitable, other more effective transformations should 
be sought. 

The normal deviates corresponding to the various proportions 
of deaths can be read from any table of normal integrals, such 



THE PROBIT TRAKSFOBMATION 


241 



POSE IN LOC UNITS 


Fia. 10 

Above. Diagram to show the relation of the proHt value (y) to the Pf 

proSrtion of deaths (P) and survivals (<?). Probits are. no^l deviates 
to which 6 has been added, and the ordinate (Z) at y is used m calculating 

workinc: probits and weighting coefficients j .t r 

Below, The visually fitted provisional lino (broken) and the regrewion 
calculated from it (solid) for Smith’s data on 

arrow marks the point to which the upper part of the figure refers 


as the table of c used in finding the probabilities corresponding 
to deviations of a quantity from its expected value (see 
Section 14 and Fig. 1), for the area of the curve cut oS by any 
ordinate is the integral of the curve up to that ordinate. It 
is, however, then necessary to use - and + signs to denote pro- 




242 


STATISTICAL AlfALYSIS m BIOLOGY 


portions below and above 0*5 deaths. (This proportion corre¬ 
sponds to the centre, i.e. mean of the normal distribution, which 
for the purpose of tabulation is taken at 0.) Hliss has pointed 
out that the inconvenience of the — sign can generally be removed 
by adding 5 to the values of the normal deviates. The value 
so obtained is called a probit, and a probit of less than 5 corre¬ 
sponds to a proportion of less than 0*5, i.e. to a negative normal 
deviate, while a probit of greater than 5 indicates a proportion 
over 0*5, i.e. a positive normal deviate. The probit value 6, of 
course, corresponds to a proportion of 0*5 and a normal deviate 
of 0. Negative probits are obtained only rarely, with extremely 
low proportions of death. 

Tables giving the probits corresponding to various proportions 
have been prepared by Bliss and are also to be found in Fisher 
and Yates s collection. If the proportions of deaths among the 
test animals at various doses of a poison are converted into 
probits by entering in these tables, and these probits plotted 
against the dose (or more often log dose) of the poison, the points 
are expected to fall on a straight line within the limits of sampling 
error. A simple recession analysis should then serve to elucidate 
the relation statistically. Knowing this line, we can fin d the 
dose corresponding to any kill, or vice versa. 

Since the variance of the proportion of deaths is dependent 
on the proportion itself, we may expect the precision with which 
any probit is determined to vary with its value. The probit 
transformation does not share with the angular transformation 
the property of removing this dependence. Each observed point 
must, therefore, be weighted in calculating the regression line of 
probit on dose, or log dose. Furthermore, if a dose produces 
either no kill or complete extermination of the test subjects, as 
must happen fairly often in small experiments, P, the proportion 
of deaths, or Q{=l-P) the proportion of survivors has a variance 
of 0, the amount of information becoming co. We may then 
anticipate weighting difficulties in the regression analysis. The 
proper weights to be used and the way of dealing with cases of 
zero or total survival can be derived by use of the method of 
maximum likelihood, as Fisher has shown in an appendix to one 
of Bliss’s papers. 

At any given dose of poison let there be s survivors out of 
n test subjects. Then P, the probability of death is estimated 


^ 2 ,nd Q=l-P=- Further, let these proportions of death 

and survival correspond to a probit of y. With a probability 
P of death and Q of survival, we expect s survivors out of w in 

7l! _ 

___pn SQM Qf cases. Each experiment with its characteristic 



THE PBOBIT TRANSFORMATION 


243 


dose will give a likelihood expression of this kind, and fitting 
the regression line of y on dose or log dose by the method of 
maximum likelihood involves equating to zero the sum of the 
differentials with respect to y of the logarithms of these likelihood 
expressions (Section 53). The log of each likelihood expression 
is of the form 0+(n-s) log P+s log Q, C being the constant which 
vanishes on differentiating, leaving us with a differential coefficient 


with respect to P of 


n-s s 


or 


But we need the differential 


PQ'^^PQ 

coefficient with respect to the probit value, and this is found 

by multiplying by Now P, the probability of death, is the 

integral of the normal curve from probit value - co to y. The 

value of must therefore be the value of the ordinate to the 
dy 

normal curve at y, which we may, following Fisher, call Z (see 
Fig. 10). (This Z is, of course, not to be confused with the 
2 used in tests of significance.) The contribution to the maximum 

likelihood expression is then 

The first part of this expression is the difference between the 
expected survival, Qn, and that observed, s. If n and s are 
large enough for the distribution of which will be given by the 
binomial expansion, to be treated as normal, this factor Qn-s 
must be proportional to the difference between the probit ex¬ 
pected and that corresponding to the observed kill. But the 


dQ 


times 


rate of change of the probit difference on y will again be — 
that of the difference expressed in terras of proportions. In 
other words, we may replace {Qn-s) by w(y-y)2, 

where Y is the probit expected. The differential coefficient thus 
becomes (y-D—I The solution to the maximum likelihood 

'PQ 

(which consists of a series of such differential cocfficierits, one 
from each test of the different doses) thus consists of finffing by 
minimizing the sum of squares of (y-Y), the values of Y which 
have a linear relation to the dose of poison, x, as expressed m 
appropriate units. This is, of course, the method of least squares 
already used in regression calculations (Section 31), bu wi 

each point given its individual weight of in the calculation. 

The values of P and Q used in calculating the weight must, of 


244 


STATISTICAL ANALYSIS IN BIOLOGY 


Qn-s must still be T ^ becomes mfimte, even though 

L„™5,SiibS.d £■ p£.“?.“r' ‘‘Tr’ ’''■™ • '■ ”»• 

tt.Bh„„d .,„.ao„ may tte'nX ]ol"d*“S^fo 

.. fo^d, .....w .h.“t:'.^r?Lst‘.:r“d s 

oJcul.ti„g .h. ,.,gbld,j ^ 

rtp~S“ p“;toS"on's‘'i't 

^roTs"LTexpectSoM“a*d'"t by meaM of th^e 

for the calculation of a stiU^morrexacTf expectations 

Even this second calciilatinr> ^ ^ ^ desired, 

visional first hne has “ unnecessary if the pro- 

To return to tt wor£ZnVr^ reasonable accuracy. 

equation established above Qn-s°nz’J '°’is found from the 
form ^ when recast m the 


Where there 



are no survivors and 


y 


w 


H 


Bliss ln!l afso* by^F^xVtn^ yX Th 

^ y risner and Yates. These give the value of 

for each value of Y. They also give ^ for each value of 

Y. so that y„ can be found when a + 0, by subtracting from the 

tabulated value of Y+|. corresponding to the probit Y expected 


from the provisional regression line, | multiplied by the observed 

fraction - which survive, and which we may denote by q. 
When less than half the test animals are expected to die. 



THE PBOBIT TRANSFORMATION 


245 


i.e. Y < 6, it may be more convenient to use the complementary 
expression 




71-5 


and Bliss’s table gives the material for this too. Both Bliss and 
Fisher and Yates give for the various values of Y the correspond- 

ing weighting coefficients which must, of course, be multiplied 

by n the number of individuals in the test. It will be seen that 
the weight is at a maximum when 7=5 (see p. 227) and falls off 
symmetrically on each side of this value. Thus the weight 
when 7“5+a is the same as that when 7=5-a. This symmetry 
is seen too in another connection, for the probit for a Idll of 
6 out of n animals is the complement of the probit for a survival 
of 8 out of n animals in that they depart equally from the value 5 
but in opposite directions. The one can be found by subtracting 
the other from 10. This property is used in finding when 
7<5 from Fisher and Yates’s tables which cover the range 
7-6-0 to 7»8-9. 


The use of the transformation tables in arriving at the probits 
and of the weights in calculating the regression lines will be 
better seen from an arithmetic example. 

Example 35. Table 61 gives the numbers of frogs {Rana 
pipiens) tested with various doses of anhydrous ouabain and the 
numbers d 3 dng in the various test classes as observed by Smith. 
The dose was adjusted to the weight of each individual frog and 
so is expressed as milligrams of ouabain per gram live weight of 
frog. Gaddum has shown that the log transformation is required 
if these data are to give a linear relation of probit to dose, and 
so the dose is expressed in log milligrams per gram. Since in 
every case the logarithm lies between -3 and -4, 4 has been 
added to each log dose to ease calculation by removing the 
minus signs. These doses may therefore be regarded as log 
milligram per 10 Kg. of frog. 

The proportions killed (p) were used to find the empirical 
probits (y) from Fisher and Yates’s Table IX. No probit can, 
of course, be given for the highest dose when all frogs died. These 
empirical probits are plotted against the dose, in log umts, in 
Fig. 10, and the provisional regression line (shown broken in the 
figure) drawn in by eye. If this line is drawn so as to give 
a reasonably good fit, it may be unnecessary to use the first 
calculated regression line as the provisional for a second calcula¬ 
tion. It is therefore profitable to devote some consideration to 
drawing this first visual provisional line. In Fig. 10 it passes 
through the points representing probit 4-5 at a:"0*3 and probit 6*9 



246 


STATISTICAL ANALYSIS IN BIOLOGY 



O 

o 



« 

O 

CO 

t^ 

CD 

CO 


uo 

o 



CD 

00 

CD 


t-* 

ID 


00 

# 


C5 


D> 





CD 

CD 




4) 

c. 

CL. 

d 

C 

d 

« 

g 

d 

4 


§ 


s 

O uo t^ X ID 

ID O CO CD ID O 

CD CD €0 CO 00 ^ 

^ ob A 

^ ^ ^ 

» 

>> 

CD ^ CO <o 

^ O CO Dl t- 
CO O GO »D LO ID 

CO O CO t-* d CO 


^ ^ ^ ♦ 

ID ID ID CD 

Y 

(visual) 

^ Cl 00 ID Ol 

ID ^ A ^ O 

^ lb ID CD CD 


CO O ID ID Cl 

Cl O 00 CD 00 1 

CO O CO t-* d 


^ ID ID lb CD 

o* 

ID O ID-CI O O 

ID CO d ^ O 

o O O o o o 


ID O ID^ O O 

Cl ID CD A O 


Q O O O O ^ 

"5* 


FROGS 
(n) Survived 

« O i> o 

•o 

o 

M 

O O O CO o o 

0< ^ >-( 

o 

H 



9) 

‘S 

9 

to 

o 


a 

CO 

O e 
Q S 


to 

u 

o 

a« 

i 

a 


d 

o 




d 




d 

ID 

o 



o 

CO 

CO 

•01 

ID 

CD 

o 

o 

<6 

O 

# 

O 


O 

O 


O 

♦ 

o 


ID 

o 

o 


d 

»D 

o 

ID 

d 

d 

eo 

CO 

O 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 

o 


o 
o 

o 
o 
o 

o 6 



THE PROBIT TRANSFORMATION 


247 


at a>0*6. Its slope is therefore 8*0 probits for a change of 1 in x. 
Then starting from the point 7=4-5, x=0-3 (these are now expected 
probits, 7. as they are obtained from the provisional line), we 
find 7=4-5+(0‘0010x8-0)=4*51, for our lowest dose at x=0-3010, 

and so on, as shown in Table 61. , . r v 4.1, 

The next step is to find the working probits, from T, the 

expected probit, and 9^=!^ the proportion of survivors observed 

in each experiment. The easiest case is that of the test at the 
highest dose where no frogs survived. Here 7-6-92 and g=0. 

Then j/„=7-h-, the so-called maximum working probit. From 

Fisher and Yates’s Table XI we find 2/„=7-3376 when 7-6-9 and 
V =7-4214 when 7=7-0. Linear interpolation then gives 
^”=7-3544 for 7=6-92. At the next dose, .-c-O-Sttl we have 

^=6-45 and g-0-10. Then r+|-01oi. For 7=6-4 the table 

gives 7-l|-6-9394 andl=6-6788,so that i/„=6-9394-0-6679-6-2715. 

Similarly^at 7=6-5, i/„=6-2437 and by interpolation, y„=6-2576 

The'^remaining working probits are obtained - the same way, 

the only comment being required by -i^^king 

for example, at the lowest dose where Y-i f - The working 

probit in such a case can be found directly from Bhss s table 
using the formula y„= 7 -where p is the observed proportion 

of survivors. Fisher and Yateses table does not ^®1^® ^e' 
of 7 lower than 5-0, so that the direct 

We have, however, ^If^dy seen that ^he^case wh ^ 

out of n is the complement of that directions 

the two probits deviate equally, u nrobits is 10 

from 5 To put it another way, the sum of the two probits s lu. 

Tr 4.U r " we take the complementary probit 7 

-tt7;tir;rocess 

ofg. From the table an expected pro 1 ^ ^ similarly gives 

i'S3‘'"'SSiTh:»Tv.. Sf;. 2 

VMt for 

?ose 0 3522 is si^arly calculated as 5-0002. 



248 


STATISTICAL ANALYSIS IN BIOLOGY 


• • • ThZ ^ 

The weighting coefficients, are also found from Fisher 

and Yates’s Table XI, using the expected probits Y. Thus 
when y=6*9, the table gives -^=0*15436 and when F=7*0, 

—=0-13112, so that by interpolation when 7=6-92, ^=0-1497. 

Since at this highest dose 7i=10, w;=l-497. When 7<5 use is 
made of the fact that the weight for the expected probit equals 
that for the complementary probit 7'. Thus for 7=4-51 we use 
the weight given by 7'=10-4-51=5-49. 

The rest of the analysis follows the lines of simple regression 
calculatioiM as discussed in Sections 31 and 32, with the slight 
complication that each point (y^,, x) is given its individual weight. 
We have already seen, in Section 41, how weights are used in 
calculating weighted means, and the formulae for calculating 
weighted regression lines can easily be derived in the same way 
from those used in unweighted regression analysis. 

Givmg a weight m; to a point may be regarded as equivalent 
to saying that that point has been observed w times, each 
hypothetical observation having weight 1. Then the total 
number of such hypothetical observations is given by S{w), and 
the sum of for the hypothetical assay is similarly 

Hence y Sunilarly 

S{w )' 

Turning to the sum of squares and cross-products, the sum 
of squares of deviations of x from zero is clearly S{wx^) and the 

correction term must be The sum of squares of devia¬ 

tions of X from ^ is thus 


S{w) 


Similarly 


and 


S{w) 

S{w) 

Applying these formulae to the data of Table 61 
iS'(u;)=49-441 -S(wJa:)=.19-3485 S(tcyJ=25'J-e056 

S{wx^)=l-S8ll33 364.geg3gj 5 '(ii,xyJ= 103-376004 

so 5=0-3913 y„=5-2104 

iS[M;(a:-^)3]=7.881733-i?:?i?^=0-309797 

49-441 

'S[u)(y„-yja]=22-347836 -S[u;y^(x-5)>2-563345 



Then 


THE PROBIT TRANSFORMATIOH 


249 


The calculated regression line is thus 

r=5-2104+8-2743(a:-0-3913), 

Y now being the probit expected from this new line and so 
replacing the expectations from the visual provisional line. 

We must next enquire into the adequacy of this hue as 
a representation of the data, and also find the standard errors 
of the regression constants. The amount by which the sum of 
squares of is reduced in fitting the 

analogy with unweighted regression, be ^ 

the present case becomes or 21-209817. This must 

correspond to 1 degree of freedom, leaving 4. out of the 5 between 
6 observations, for the remainder, or error, sum of squares. 

Before the full significance of this analysis of the sum of 
squares of can be appreciated, it must be observed that the 
weighting coefficients are themselves the theoreticaUy fixed 
amounts of information concerning the probits to which they are 
attached. Now the variance of p in the bmomial expansion of 

(y+g)a was seen in Section 13 to be where p+g-1. The 
amount of information about P, the proportion of deaths, is 
therefore —, and the amount of information about the corre- 

„ /dY\^ 

spending probit T wiU, as seen on p. 218, be 


dY 


>Z and so I 


nZ 


^w. 


dP ~ ~ * PQ ^ , X i. j- -j- 

Weights of this kind are therefore equivalent to dividing 
each observation in the calculation of the mean, and each squared 
and cross-multipUed deviation in the calculation of sums of 
squares and cross-products, by the correspondmg theoretically 
fixed variance. The sums of squares for regression and remamder 
are thus y^8y because each of their constituent parts is the ratio 
of an observed squared deviation to a variance fixed by hypo¬ 
thesis. Our analysis of the sum of squares of y^ thus becomes 

„• N Probability 

Itora " 

Rogrc-ssion . • • 

Remainder (Error) 


21-209817 
1 138019 


1 

4 


very small 
0 - 9 - 0-8 


Total 


22-347836 


The regression line clearly accounts for all but an insignificant 



250 


STATISTICAL ANALYSIS IN BIOLOCY 

portion of the variation in probit value, and so affords an adequate 
aescription of the probit dose relation. 

Since the error is insignificant, we take the residual variance 

o round the regression fine as 1, the average contribution each 

degree of freedom is expected to make to a j; 2 . (With a significant 

rror x it would have been necessary, of course, to treat as 

a sum of squares, divide it by N and take the resulting mean 

square as Vy, (as on p. 192). Then, once again by analogy with 
unweighted regression, o J ej 


= yy 






1 


S{w) 49-441 

Vy 1 


=0 020226 and 5^=“ V7^=0-1422 


w. 


* >S[w?(a:-^)‘‘^j^0-309797 

* represents the rate of change of 

case in 1^’ ^ “^^sured m probits, on the dose expressed in this 
case m log umts. It is, however, helpful to look at it in another 

MTV ■ « 


or 0 1209, is the 


way. The reciprocal of 6, in this case 

8* 2 V43 

tr[ncr!«!^ ‘'i!? ®^P''essed in logarithmic units, required 

of Z Zl standard derdation 

test individual lethal doses of the 

dose of thp to t^ • variation in individual lethal 

line anr? fV«o ^ the slope of the regression 

a nartienlfl precisely will the dose, necessary to give 

a particular percentage kiU, be estimated. 

dosaffe^ death is widely used in connection with 

and sin?e ^ generaUy denoted by LD50 

and, smce it is the dose corresponding to ^=5-0, is foimd as 

b 

This is a special case of the general relation considered on p. 118. 

Y^=a+b{x^~^) 

s'^^tod set r.=50 and 

so find *.=iD50. Now as we have seen = F +te which 

m weighted regression, when F„ is tak’en'to bf f blm^s 



1 {x~^)^ 



But since 



THE EROBIT TRANSFORMATION 


251 


dr. 


is the ratio of change of 7 on x, i.e. is here the slope of the 


dx 

regression line b. 
Hence 


{x-^) 


^ ^ 6-^L'S^(u;) S[w{x 


and, in particular, the variance of LD50 is 

b^ _S{w) 

In the present case 5=0-3913, 6=8-2743, 

5(wj)=49-441 and 5[M;(a:-5)2]=0-309797. Hence 

J:,D50=0-3913+—^^l,--0-3659 in log units 


8-2743 


(or 0-0002322 mgm. ouabain per gm. frog weight) and 


1 r 1 ^ 0-02542 

Z,Z)5o=g 274^2 49-441^ 0-3098 


=0-0003259 


Then 5/.i>5o=V^"i:n5o=0-01805 log units {or 000001145 mgm. 

^ T^is Wst calculated regression line may itself be used as the 
provisional line in the calculation of a second, and presumably 
still better fitting, regression. Indeed, if the visual provisional 
line was a poor fit, the first calculated regression hne cannot be 
expected to give a fuUy satisfactory representation of the data, 
and it must be used on the provisional line for a second calculation. 
In the present case, however, the use of our calculated relation 

y=5-2104+8-2743(a;-0-3913) 
as the provisional line gives, after the second fitting 

r=5-2087+8-2824(x-0-3912). 

The remainder or error x' now 1-1701, a value slightly higher 
than that obtained after the first fitting. The second calculation 
shows no improvement on the first and none of the statistics 
obtained from the second fitting differed by so much as their 
standard errors from those given by the first calculation. In 
narticular, the second estimate of LD50 is 0-3660±0-01810 as 
compared with the first estimate of 0-3659±0-01805. 

The first regression must therefore be judged to have he^ 
adeouate The visual provisional line used as the basis of the 
first fitting was satisfactory. By removing the need for a second 
fitting it well repaid the care expended m drawing it. 

Once a regression line has been found to represent the relation 
between dose and effect, it may be compared ^^th other such 
lines, differing from it in slope or position or both, for such 



252 


STATISTICAL ANALYSIS IN BIOLOGY 

purposes as comparing the potencies of two samples of a drug, 
or standardizing a sample of unknown potency against one 
whose potency is ^own. A dosage-effect regression line may 
also be used to estimate the dose of poison required to achieve 
a specified percentage kill in insecticidal work, in a way similar 
to that used for finding the LD50. These uses of the line have 
been fully described by Gaddum, BHss, and Irwin, to whose 

works reference should be made for the methods to be employed 
m the various circumstances. 


xvii, j? UilJtS 

BLISS, o. r. 1935a. The calculation of the dosage-mortality curve. 
Ann. Appl. Biol., 22, 134-67. ^ 

dosage-mortality data. Ann. Appl 

1938. The ^termination of the dosage-mortality curve from smaU 
numbers. Quart. J. Pharm., 11 , 192-216. 

GADDUM, j. H. 1933. Reports on biological standards. HI. Methods 
of biological ^ay depending on quantal response. Med. Res. 
Council Spec. Rep. No. 183. H.M. Stat. Office. 

FISHER, ^1935* The case of zero survivors. Appendix to Bliss 

and Y^ES, ^*19^3. Statistical Tables for Biological, AgricuUund 
^ ^^^^IRcs^rch. Oliver and Boyd. Edinburgh. 2nd ed. 
IRWIN, j. o. 1937. Stat^tical method applied to biological assays. 

Suppl J. Roy. Statistical Soc., 4, 1-60. 

SMITH, M. I quoted by gaddum. j. h. 1932! The biological assay of 

“ comparison with Ouabain. Quart J. 



GLOSSARY OF TERMS 

Bias. A consistent and false departure of an observed quantity &om its 
oroner value. The average error of an estimate. 

Binomia^Z senes. The series obtained by eipandmg to any power the 

sum of two quantities. 

c. The normal deviate. The ratio of an observed deviation to the appro¬ 
priate standard deviation fixed by hypothesis. 

Con^unding. The deUberate sacrifice of a potentially mterestmg com- 

identifying it with one which it is intended to ehmmate 
hi order to reduce the error variation. Such a comparison is said to 

The confounding of a comparison m part of a design, 
or in such a way that, after the elimination of ill-controlled com¬ 
parisons, it is estimated with accuracy less than if such comparisons 

ConH^eX‘'^abu]'^^AiB.hle of frequencies distinguishing two classiBcations 

TjS'^'interdependonce of two variafr^ The 
independence. AppUed also to the analysis of such interdependence 

by methods involving the two variates ! 5 ™n®trically. 

^_ coe^ent. The ratio of the covariance of two vanates to the 

geometrical mean of their variances. /iSofin 

^Ur-class -The correlation of vanates winch are distm- 

"It"—" Trctt^f variates which are not distin- 

"^“^t^cTrrtatont/v^^ when due allowance is 
mode for the effects of other uncontrolled vanates. 

to the corresponding number of degrees of freedom 

Analysis of -. The simultaneous analysis of the sums ot 

aouares and cross-products of two or more vanates. 

CrossTXr of -Sum of the products of co^espondmg 

deviations of two variates from their mdijndual means. 

X*. The ratio of an observed sura of squares to the corresponding variance 

in which the observed sum of squares is based 

on a single comparison. The square of a normal deviate. 

Compand One in which the observed s^ of squares is 

based on several independent comparisons, and which can hence be 
resolved into two or more simple x* 

Degree of freedom. A comparison between the data, independent of the 

other comparisons used in the analysis. • *u + 

Number of -. The number of independent comparisons that 

can be made in the data. 

De!!S^n.^D^rture of any observation or quantity from its expected 
^^S^ndard _- The distance, measured along the abscissa, of the 

17 263 


254 


STATISTICAL ANALYSIS IN BIOLOGY 

point of inflection, or maximum slope, from the mean in a normal 
curve. Generalized as the square root of the variance. Called the 
iitatidard Error when the deviation can properly be regarded as an 
error, as in the case of estimates of parameters. 

Discriminant function. Linear compound of a series of variates with 
coefficients chosen to maximize the difference between the classes of 
object, of which the variates are measurements, relative to the 
variation within classes. 

Distribution. Frequency -. The distribution obtained when the fre¬ 

quencies with which observations fall into certain classes are plotted 
against those classes. 

Nornial - (or Normal Curve of Errcrra). The limit to the 

binomial and multinomial series when the power is large and none 

of the summed quantities very small. The frequency distribution 

expected from a series of observations on a variate whose magnitude 

is influenced by a large number of agents having small independent 
effects. 

Effect. Main -. The direct effect of any treatment or agent, without 

reference to the effects of other treatments or agents. 

Interaction . The mutual effects of two or more treatments 
on one another’s action. 

Efficiency of a method of estimation. The amount of information extracted 
from the data by the method, as a fraction of that extracted by 
maximizing the likelihood. 

Error. Control of -. The reduction of error variation by the use of 

restraints in experimental design. 

Sampling . The variation in value of a statistic arising from 
the use of finite samples. 

Standard ■ . See Standard Deviation. 

variance. The variance arising from agents uncontrolled in 
the experiment, with which the apparent effect of any controlled 
agent must be compared. 

Estimate. Consistent -. A statistic which tends to approach the 

parameter in value as the sample size increases. 

Efficient . A statistic which tends, as the sample size increases, 
to use all of the information available in the data. 

^ Sufficient . A statistic which uses all the relevant information 

in the data, even in small samples. Sufficient statistics are not 
always available. 

Estimation. Combined -. The calculation of a statistic or statistics 

from several sets of unlike data taken togetlier. 

Simultaneous . The calculation of two or more statistics from 
data simultaneously. 

Factorial e^eriment. One in which all the treatments or agents under 
investigation are varied simultaneously and combined in such a way 
that any desired main or interaction effect may be isolated and 

Fiducial probability. The probability that a parameter Ues within cer¬ 
tain Imiits, the Fiducial limits, as exactly determined from the 

information afforded by direct observation, without resort to any 
hypothetical information. 

Grouping. The process of arranging measuroments in such a way that 
the observations falling within a given range are replaced, for the 



265 


GLOSSAB'X OF TERMS 

purposes of calculation, by an equal number of hypothetical measure- 
mente at the centre of that range (see also Sheppard). 

are calculated are Orthogona^ /uncJ^r^. characterizes a body 

variance of an efficient statistic. 

Invariance. The reciprocal of the vari^co. similar calcu- 

lUration. The method of solving equations by a series ot similar 

lations each leading to the next. 

to the tails and top. 

-JSb'TSr e.”“ 

function. 

. X- A cftries of observations or quantities. 

Mean. The arithmetic average of a series ot 

Used as an abbreviation for ^ each observation or 

Weighted ——. -weicht in the calculations, 

quantity is given a or leS approximating to the true 

mean, used for the purpose 

mean, the sum of squares and o gjje of which lie equal 

Median. That value of the vanate on each side ot w 

numbers of observations. *hft most frequent observe- 

Mode, That value of the variate shown by the mosi> i 

tional class. . ^vf^ined bv expanding, to any power, 

Multinomial series. The senes obtain^ p 

the sum of three or more quantities. 

fr^ S'hi^erptmaorare^^rm^^^^^^^^^ 

Orthogonal. See Independent, 

Parameter. A quantity .hose value is necessary for the specification of 

a hypothetical population. sauaree or a compound x‘ 

Porfition.^ The breaking down analysis, 

into simple components, for the purpose or ana j- 



256 


STATISTICAL ANALYSIS IN BIOLOGY 

Polynomial. A simple power series, of any order, of a variate. 

Orthogonal ~—s. Polynomials which take out independent sums 

dependent variate in regression analyses. 

hypothetical infinitely large series of observations or 

mchvaduals of which those observations or individuals actuaUy 
obtained form a sample. 

p7o^Utu “ quantity or statistic is ascertained. 

^heTntLe^TaTtrd^o^dtottn.^ 

ev^nt”^^^ * ^*'°hability relating to the occurrence of single 

thfinnrOTe^t~' relating to the occurrence of more 

Bandom. ^iyed at by chance mthout the exercise of any choice. 

_ • ^ \ of arriving at a random arrangement* 

Reoressi&n THa a design involving one restraint. 

vsr - 

of the f£npnHfi^+ The coefficient representing the rate of change 

CuL ~ A independent%ariate. 

the third power. involving the independent variate to 

S^tion^^^ curve showing the regression in a geometrical 


representation 

Linear -. 

the first power. 
Multiple 


A regression involving the independent variate to 

Dendent^i^riafj:^Q ^®^ession, usually linear, on two or more inde- 
Partta! Begreaaim.) ‘temselves be correlated. (Also caUed 

related”^^t^T^!^J ^ r^ession in which the dependent variate is 

f polyno^al function of the independent variates. 

to the second po'wer. mvolvmg the independent variate 

Revlicati^^Thp, regression on one independent variate. 

two or more timoe .^®°*’Poration of all treatments or other agents 
a Replicated design."^ ^ experimental design, which is then called 

i?e«ira^^.^^A limiuticm of the random arrangement of treatments in an 

of^unbiased estimation t^Lduced still capable 

t IS reauced or potentially reduced. 

hvPoriieriGaI^infin^l°^i°*^^^'^*^**^*^ individuals taken from the 
individuals ^ argo population of possible observations or 

Selec^.^^n^exOTc^^f ^onmination in sampling or in arrangement. 

"to {uow°fOT "the “ **'® calculation 

Le to be ^uXed ^ d^P^^rttie is said to C Significant. 

^ designed to assess significance, and so to 



GLOSSARY OF TERMS 


257 


distinguish deviations due to sampling error from those indicating 
real discrepancies between obser\'ation and hypothesis. 

Skewness. Asymmetry in a frequency distribution. 

Square. Oraeco-Latin -. An experimental design involving three 

restraints. 

-. An experimental design involving two restramts. 

Mean _. Average of the squared deviations of observations 

from their mean. The ratio of the sum of squares to the number 
of degrees of freedom. Synonj-mous with estimated variance. 

Method of Least - s. A method of estimation, depending on the 

minimization of sums of squares, widely used in regression analyses. 
Sum of _ s. The sum of the squared deviations of observations 

from their mean. 

Statistic. The estimate of a parameter arrived at from observed samples. 
It bears the same relation to the sample as the parameter does to 

the population. 

I. The ratio of an observed deviation to its estimated standard deviation. 


Variance. Mean square deviation of a variate from its mean. Estimated 
as the Mean, square (q.v.). The square of the standard deviation. 

Anedysis of -. A technique for the isolation of particular 

components of variation for assessment by comparison with error 

variation. , . ^ 

_ ratio. The ratio of two estimated variances. Twice the 

natural antilog of 2 . r_ 

Variate. A variable quantity whose measurements or frequencies form 

all or part of the data for analysis. 


Weight. The relative value assigned to an estimate when it is being 
combined with other estimates of the same quantity. Usually the 
weight is the invariance or amount of information. 

Yates's correction for continuUy. A correction applied in the calculation 
of normal deviates or x*’s to allow for the discrepancy arising from 
the fact that the observations are discontinuous while the tables of 
c and X* ar© calculated on the supposition of contmuity of the variate. 


z. The natural logarithm of the ratio of two estimated standard deviations. 
(See also Variance ratio.) 




TABLES 


TABLE I 


Table of c {the Normal Deviate) 


Probability 

c . 

1 

0-95 

0063 

0*90 

013 

0-80 

0-25 

/ 

0-70 

0-39 

0-60 

0-52 

0*50 

0-67 

0-40 

0-84 

Probability 

c . 

0-30 

104 

0-20 

1-28 

0*10 

1*64 

0-05 

1-96 

0*02 

2-33 

0-01 

2-68 

0-001 

3-29 


and publishers, messbs. ouvek' an^ Ioto 1 ^ ^ permiBsion of the authors 


TABLE II 


Table of t 


N 

0-90 

0-80 

0-70 

0-50 

0-30 

Probability 
0-20 0-10 

0-05 

0-02 

0-01 

0-001 

1 

2 

3 

4 
6 

0-16 

0-14 

0-14 

013 

0-13 

0-33 

0-29 

0-28 

0-27 

0-27 

0-61 

0-45 

0-42 

0-41 

0-41 

1-00 

0-82 

0-77 

0-74 

0-73 

1-96 

1-39 

1-25 

1-19 

1-16 

3-08 

1-89 

1-64 

1-53 

1-48 

6-31 

2-92 

2-35 

2-13 

2-02 

12-71 

4-30 

3-18 

2-78 

2-57 

31-82 

6-97 

4-64 

3-75 

3-37 

63-66 

9-93 

5-84 

4-60 

4-03 

636-62 

31-60 

12-94 

8-61 

6-86 

6 

7 

8 

9 

10 

0-13 

0-13 

0-13 

013 

013 

0-27 

0-26 

0-26 

0-26 

0-26 

0-40 

0-40 

0-40 

0-40 

0-40 

0-72 

0-71 

0-71 

0-70 

0-70 

1-13 

M2 

Ml 

MO 

1-09 

1-44 

1-42 

1-40 

1-38 

1-37 

1-94 

1-90 

1-86 

1-83 

1-81 

2-46 

2-37 

2*31 

2-26 

2-23 

3-14 

3-00 

2-90 

2-82 

2-76 

3-71 

3-50 

3-36 

3-25 

3-17 

5-06 

5-41 

5-04 

4-78 

4*59 

11 

12 

13 

14 

15 

0-13 

0-13 

0-13 

0-13 

013 

0-26 

0-26 

0-26 

0-26 

0-26 

0-40 

0-40 

0-39 

0-39 

0-39 

0-70 

0-70 

0-69 

0-69 

0-69 

1-09 

1-08 

1-08 

1-08 

1-07 

1-36 

1-36 

1-35 

1-35 

1-34 

1-80 

1-78 

1-77 

1-76 

1-75 

2-20 

2-18 

2-16 

2-15 

2-13 

2-72 

2-68 

2-65 

2-62 

2-60 

3-11 

3-06 

3-01 

2-98 

2-95 

4-44 

4-32 

4-22 

4-14 

4-07 

16 

17 

18 

19 

20 

013 

0-13 

0-13 

0-13 

0-13 

0-26 

0-26 

0-26 

0-26 

0-26 

0-39 

0-39 

0-39 

0-39 

0-39 

0-69 

0-69 

0-69 

0-69 

0-69 

1-07 

1-07 

1-07 

1-07 

1-06 

1-34 

1-33 

1-33 

1-33 

1-33 

1-75 

1-74 

1-73 

1-73 

1-73 

2-12 

211 

2-10 

2-09 

2-09 

2-58 

2-57 

2-55 

2-54 

2-53 

2-92 

2-90 

2-88 

2-86 

2-85 

4-02 

3-97 

3-92 

3-88 

3-85 

22 

24 

26 

28 

30 

0-13 

0-13 

0-13 

0-13 

0-13 

0-26 

0-26 

0-26 

0-26 

0-26 

0-39 

0-39 

0-39 

0-39 

0-39 

0-69 

0-69 

0-68 

0-68 

0-68 

1-06 

1-06 

1-06 

1-06 

1-06 

1-32 

1-32 

1-32 

1-31 

1-31 

1-72 

1-71 

1-71 

1-70 

1-70 

2-07 

2-06 

2-06 

2-05 

2-04 

2-51 

2-49 

2-48 

2-47 

2-46 

2-82 

2-80 

2-78 

2-76 

2-76 

3-79 

3-75 

3-71 

3-67 

3-65 


When iV is greater deviate without 

senous inaccuiacy resulting. 

[Abridged from Statistical Tables tnr r 

Research by r. a, fisher and f tates with Medxcal 

and publishers, Messrs, oliveb aiS^ permission of the authors 

258 



TABLES 


259 


TABLE m 
Table of 


N 


1 
2 

3 

4 
6 

6 

7 

8 
9 

10 

11 

12 

13 

14 

15 


16 

17 

18 

19 

20 

22 

24 

26 

28 

30 


0*90 0-80 0*70 0*50 


Probability 
0-30 0-20 0-10 


005 002 0-01 0001 


0*016 

0*21 

0*58 

1-06 

1-61 

2*20 

2‘83 

3*49 

4-17 

4*87 

6*58 

6-30 

7*04 

7*79 

8-65 

9*31 

10*09 

10*87 

11*66 

12*44 

14*04 

15*66 

17*29 

18*94 

20*60 


0*064 

0*45 

1*01 

1*65 

2*34 

3*07 

3*82 

4*69 

6*38 

6*18 

6*99 

7*81 

8*63 

9*47 

10*31 

11*15 

12*00 

12*86 

13*72 

14*58 

16*31 

18*06 

19*82 

21*59 

23*30 


0*15 

0*71 

1*42 

2*20 

3*00 

3*83 

4*07 

5*53 

6*39 

7*27 

8*15 

9*03 

9*93 

10*82 

11*72 

12*62 

13*53 

14*44 

15*35 

16*27 

18*10 

19*94 

21*79 

23*65 

25*51 


0*46 

1*39 

2*37 

3*36 

4*35 

5*35 

6*35 

7*34 

8*34 

9*34 

10*34 

11*34 

12*34 

13*34 

14*34 


1*07 

2*41 

3*67 

4*88 

6*06 

7*23 

8*38 

9*52 

10*66 

11*78 

12*90 

14*01 

16*12 

16*22 

17*32 


1*64 

3*22 

4*64 

6*99 

7*29 

8*56 

9*80 

11*03 

12*24 

13*44 


2*71 

4*61 

6*25 

7*78 

9*24 

10*65 

12*02 

13*36 

14*68 

16*99 


3*84 

5*99 

7*82 

9*49 

11*07 

12*59 

14*07 

15*51 

16*92 

18*31 


6*41 

7*82 

9*84 

11*67 

13*39 


6*64 

9*21 

11*34 

13*28 

15*09 


15*03 16*81 
16*62 18*48 
18*17 20*09 
19-68 21*67 
21*16 23*21 


10*83 

13*82 

16*27 

18*47 

20*52 

22*46 

24*32 

26*13 

27*88 

29*69 


14*63 17*28 19*68 
15*81 18*65 21*03 
16*99 19*81 22*36 
18*15 21*06 23*69 
19*31 22*31 25*00 


15*34 18*42 20*47 23*64 26*30 
16*34 19*61 21*62 24*77 27*69 

17*34 20*60 22*76 25*99 28*87 

18*34 21*69 23*90 27*20 30*14 

19*34 22*78 25*04 28*41 31*41 

21*34 24*94 27*30 30*81 33*92 

23*34 27*10 29*55 33*20 36*42 

25*34 29*26 31*80 35*56 38*89 

27*34 31*39 34*03 37*92 41*34 

29*34 33*53 36*25 40*26 43*77 


22*62 24*73 31*26 
24*05 26*22 32*91 
25*47 27*69 34*53 
26*87 29*14 36*12 
28*26 30*58 37-70 

29*63 32*00 39*25 
31*00 33*41 40*79 
32*35 34*81 42*31 
33*69 36*19 43*82 
35*02 37*57 45*32 

37*66 40*29 48*27 
40*27 42*98 51*18 
42*86 45*64 54*05 
45*42 48*28 56*89 
47-96 50*89 59*70 


When N ia greeter than 30, use V^‘- V2n-1 as a normal deviate. 

itiSy m S .^“vAxa/^ith °tKd pfrmlfsro’n“'of t"he “ 

and publishers, messes, oliveb and boyd.J 


260 


STATTSTICAIi ANALYSIS IN BIOLOGY 


TABLE IV 


Table of Variance Ratio 
(i) 0-20 Probability Point 



1 

2 

3 

4 

5 

1 

9-5 

12-0 

13-1 

13-7 

14-0 

2 

3-6 

4-0 

4-2 

4-2 

4-3 

3 

2-7 

2-9 

2-9 

3-0 

3-0 

4 

2-4 

2-5 

2-5 

2-5 

2-5 

6 

2-2 

2 3 

2-3 

2-2 

2-2 

6 

2-1 

2-1 

2-1 

2-1 

2-1 

7 

20 

2-0 

2-0 

20 

2-0 

8 

20 

2-0 

2-0 

1-9 

1-9 

9 

1-9 

1-9 

1-9 

1-9 

1-9 

10 

1-9 

1-9 

1-9 

1-8 

1-8 

11 

1-9 

1-9 

1-8 

1-8 

1-8 

12 

1-8 

18 

1-8 

1-8 

1-7 

13 

1-8 

18 

1-8 

1-8 

1-7 

14 

1-8 

1-8 

1-8 

1-7 

1-7 

15 

1 

1-8 

1-8 

1-8 

1-7 

1-7 

16 

1-8 

1-8 

1-7 

1-7 

1-7 

17 

1-8 

1-8 

1-7 

1-7 

1-7 

18 

1-8 

1-8 

1-7 

1-7 

1-6 

19 

1-8 

1-8 

1-7 

1-7 

1-6 

20 

1-8 

1-8 

1-7 

1-7 

1-6 



14-3 

4-3 

30 

2-5 

2-2 

21 

20 

10 

1-8 

1-8 

1-8 

1-7 

17 

1-7 

1-7 

1-6 

1-6 

1-6 

1-6 

1-6 



14-9 

4-4 

30 

26 

2-2 

20 

1-9 

1-8 

1-8 

17 

1-7 

1-7 

1-6 

1-6 

1-6 

1-6 

1-6 

1-6 

1-6 

1-5 


24 


15-2 

4-4 

30 

2-4 

2-2 

20 

1-9 

1-8 

1-7 

1-7 

1-6 

1-6 

1-6 

1-6 

1*6 

1-6 

1*6 

1-6 

1-5 

1-6 


00 


16-6 

4-5 

30 

2-4 

21 

20 

1-8 

1-7 

1-7 

1-6 

1-6 

1-5 

1-5 

1-5 

1-6 

1-4 

1-4 

1-4 

1-4 

1-4 






TABLES 


261 


(ii) 0 05 Probability Point 



1 

2 

3 

4 

5 

6 

7 

8 
9 

10 

n 

12 

13 

14 
16 

16 

17 

18 

19 

20 

22 

24 

26 

28 

30 

60 

120 

CO 


6 


12 


24 


161-4 

18-6 

10-1 

7-7 

6-6 


6-0 

6-6 

6-3 

6-1 

5-0 

4-8 
4-8 
4-7 
4-6 
4 5 

4-5 

4-6 

4-4 

4*4 

4-4 

4-3 

4-3 

4-2 

4-2 

4-2 


199-5 
19-0 
9-6 
6-9 
5-8 

5-1 

4-7 

4-5 

4-3 

4-1 

40 

3-9 

3-8 

3-7 

3-7 

3-0 

3-6 

3-6 

3-6 

3-5 

3-4 

3-4 
3-4 
3 3 
3 3 


215-7 

19-2 


224-6 

19-3 


230-2 

19-3 


234-0 

19-3 


243-9 

19-4 


4-0 
3-9 
3 8 


3-2 

3-1 

3-0 


9-3 

9-1 

9-0 

8-9 

8-7 

6-6 

6-4 

6-3 

6-2 

6-9 

5-4 

5-2 

6-1 

6-0 

4-7 

4-8 

4-5 

4-4 

4 3 

40 

4-4 

4-1 

4-0 

3-9 

3-6 

4-1 

3-8 

3-7 

3-6 

3-3 

3-9 

3-6 

3-5 

3-4 

3 1 

3-7 

3-5 

3 3 

3-2 

2-9 

3-6 

3-4 

3-2 

3-1 

2-8 

3-5 

3-3 

3-1 

3-0 

2-7 

3-4 

3-2 

3-0 

2-9 

2-6 

3 3 

3-1 

3-0 

2-9 

2-5 

3 3 

3-1 

2-9 

2-8 

2-6 

3 2 

3 0 

2-9 

2-7 

2-4 

3 2 

3 0 

2-8 

2-7 

2-4 

3-2 

2-9 

2-8 

2-7 

2-3 

3-1 

2 9 

2-7 

2-6 

2-3 

3-1 

2-9 

2-7 

2-6 

2 3 

3-1 

2-8 

2-7 

2-6 

2-2 

3-0 

2-8 

2-6 

2-5 

2-2 

3-0 

2-7 

2-6 

2 5 

2-2 

3-0 

2-7 

2-6 

2-4 

2 1 

2-9 

2-7 

2 5 

2-4 

2 1 

2-8 

2-5 

2-4 

2-3 

1-9 

2-7 

2-6 

2-3 

2-2 

1-8 

2-6 

2-4 

2-2 

2-1 

1-8 


249-0 

19-5 

8-6 

5-8 

4-5 

3-8 
3-4 
3 1 
2-9 
2-7 

2-6 

2-5 

2-4 

2-3 

2-3 

2-2 

2-2 

2-1 

2-1 

2-1 

2-0 

2-0 

2-0 

1-9 

1-9 

1-7 

1-6 

15 


00 


254-3 

19-5 

8-5 

5-6 

4-4 

3-7 

3-2 

2-9 

2-7 

2-6 

2-4 
2 3 
2-2 
2-1 
2-1 

2-0 

2-0 

1-9 

1-9 

1-8 

1-8 

1-7 

1-7 

1-7 

1-6 

1-4 

1-3 

1-0 



262 


STATISTICAL ANALYSIS IN BIOLOGY 


(iii) 0-01 Probability Point 


N, \ 

1 

1 

2 

3 

4 

5 

6 

12 

24 

00 

1 

4,052 

4,999 

6,403 

5,625 

5,764 

5,859 

6,106 

6,234 

6,366 

2 

98-5 

99-0 

99-2 

99-3 

99-3 

99-3 

99-4 

99-5 

99-5 

3 

341 

30-8 

29-5 

28-7 

28-2 

27-9 

271 

26-6 

26-1 

4 

21-2 

18-0 

16-7 

16-0 

15-5 

15-2 

14-4 

13-9 

13-6 

6 

16-3 

13-3 

121 

11-4 

11 0 

10*7 

9-9 

9-6 

9-0 

6 

13-7 

10-9 

9-8 

9-2 

8-8 

8-5 

7*7 

7-3 

6-9 

7 

12-3 

9-6 

8-5 

7-9 

7-5 

7-2 

6-5 

61 

6*7 

8 

11-3 

8-7 

7-6 

7-0 

6-6 

6-4 

6-7 

5-3 

4-9 

9 

10-6 

8-0 

7-0 

6-4 

61 

6-8 

51 

4-7 

4 3 

10 

100 

7-6 

6-6 

6-0 

5-6 

5-4 

4-7 

4-3 

3-9 

11 

9-7 

7-2 

6 2 

6-7 

5-3 

6-1 

4-4 

4-0 

3-6 

12 

9-3 

6-9 

6-0 

5-4 

5-1 

4-8 

4-2 

3-8 

3-4 

13 

91 

6-7 

5-7 

6-2 

4-9 

4-6 

4-0 

3-6 

3-2 

14 

8-9 

6-6 

5-6 

60 

4-7 

4-5 

3-8 

3-4 

3-0 

16 

8-7 

6-4 

5-4 

4-9 

4-6 

4-3 

3-7 

3-3 

2-9 

16 

8*5 

6-2 

5-3 

4-8 

4-4 

4-2 

3-6 

3-2 

2-8 

17 

8 4 

6-1 

5-2 

4-7 

4 3 

41 

3 5 

31 

2-7 

18 

8-3 

6-0 

51 

4-6 

4-3 

4-0 

3-4 

3-0 

2-6 

19 

8-2 

6-9 

5-0 

4-6 

4-2 

3-9 

3 3 

2-9 

2-5 

20 

81 

6-9 

4-9 

4-4 

41 

3-9 

3-2 

2-9 

2-4 

22 

7-9 

5-7 

4-8 

4-3 

4-0 

3-8 

31 

2-8 

2-3 

24 

7-8 

6-6 

4-7 

4-2 

3-9 

3-7 

3-0 

2-7 

2-2 

26 

7-7 

6-6 

4-6 

4-1 

3-8 

3-6 

3-0 

2-6 

21 

28 

7-6 

5-6 

4-6 

4-1 

3-8 

3 6 

2*9 

2-6 

2-1 


7-6 

6-4 

4-5 

4-0 

3-7 

3 5 

2-8 

2 5 

2-0 

60 

7-1 

6-0 

41 

3-7 

3-3 

3-1 

2-5 

2 1 

1-6 

120 

6-9 

4-8 

4-0 

3 5 

3-2 

3-0 

2*3 

20 

1-4 

00 

6-6 

46 

3 8 

3 3 

3-0 

2-8 

2-2 

1-8 

10 



TABLES 


263 


(iv) 0 001 Probability Point 





DIDEX 


analysis of x*» 

analysis of covariance, 121, 128 
analysis of variance, 69. 86, 115, 
119, 134, 170, 171, 234 

incomplete-, 79, 129, 170 

Antirrhinum, 191 
apple, 89 
Ashby, 129 
asparagus, 104 
assay, 227 

backcross, 15, 19, 21, 185, 223, 227 

barley, 72 

Bath, 161 

Bayes, 204 

bias, 30 

binomial, 17, 25, 26. 35, 38 et scq., 
52, 210, 234 
Blackman, 36 

Bliss, 242, 244, 245, 247. 252 
Bortkewitch, 36 
Brandt, 198 

Brassica oleracea, 201, 207 
Buck, 149 


c, 42, 43, 46 et aeq., 50, 51, 85, 192 
Catcheside, 201, 207 
children, 161 
combination, 166 
combinations, 17 

comparisons, 63, 65, 71, 178, 180, 
181, 191 

compound interest, 109 
concomitant observations, 123, 128 
confounding, 104, 108 
partial-, 107 

contingency table, 193 et aeq., 196 
et aeq., 200 et aeq. 


continuity, 51, 174 

Yates’ correction for -, 

175, 176, 195 

correction term, 65, 58, 68 
correlation, 109, 160 et aeq. 
_coefficient, 160, 162, 

169. 173 

inter-class —, 161, 1^3 

intra-class-, 168, 173 

partial-, 167 

transformed -, 164, 165, 



168, 


168, 


169 

covariance, 35, 54, 147, 148 


Crane, 55 
cubic curve, 133 

46 et aeq., 174 et seq., 192, 220, 
226, 237, 249 

Datura stramonium, 50, 175, 189 
decomposition, 107 
deductive reasoning, 9 
degrees of freedom, 33, 40, 43, 45, 
46, 47. 57, 60, 61 etseq., 69, 72, 
114, 133. 148, 155, 164, 170, 
178, 189, 231 

loss of-. 33, 115, 188, 200. 

220, 226 

deviation, 42, 44, 231 

-191, 192, 196, 225 

diagrams, 12 

difference, variance of, 54 
discriminant function, 152 et aeq. 
Dobzhansky, 79, 156 
Drosophila melanogaaier, 21 
Drosophila paeudo-obscura, 79, 156 
drugs, 227 

East, 28 
Eden, 35 

efficiency, 216, 230 

-index, 109 

Emerson, 28 

error, 66, 68, 69, 70, 78, 85, 87, 93, 
94, 110 {see also sampling error) 

control of-, 89, 93 et seq. 

Eryngium maritimum, 36 
estimate, 11, 28 
estimation, 203 et aeq. 

combined,-, 222 

simultaneous,-, 221 

Fj, 16, 223, 228 

factorial experiment, 87, 91, 93, 102 

-notation, 16 

fireflies, 149 

Fisher, 12, 28, 36, 38, 43, 44, 71, 86, 
97, 102, 108, 138, 148, 152, 169, 
164, 183, 185, 187, 194. 195, 
204, 210, 216, 218, 223. 227. 
231, 234, 242, 243, 244, 245, 
247 

frequencies, 48, 174 
frequency distribution, 29, 161 
-surface, 160 


265 



266 


STATISTICAL ANALYSIS IN BIOLOGY 


Gaddum, 245, 252 
gene, 14, 19, 21, 38. 50, 175, 181, 
191, 213, 222, 227 
genetical experiments, 50 
goodness of fit, 46 
Graeco-Latin square, 101 
graph, 12 
Griffiths, 161 
grouping, 30 

Sheppard’s correction for-, 

32 

growth, 109 

haemacytometer counts, 36 
Haldane, 213 
Harrison, 116 
Hartman, 196 
Hayes, 72 

heterogeneity, test of, 189, 192, 225 
et seq. 

histogram, 34 

hypothesis, 10, 165, 188, 204 
null-, 65 

Imai, 183 
Immer, 72 

incomplete classification, 231 
independent comparisons, 64, 91, 
103 

-observations, 53 

-regressions, 142, 147 

inductive reasoning, 9 
information, amount of, 166, 205, 
210, 212, 218, 224, 227, 232, 249 
integral coefficients, 136, 143 
interaction, 70 et seq., 79, 183 
invariance, 166 
Iris, 159 
Irwin, 252 
iteration, 208, 212 

kurtosis, 35 

Larxun, 171 

Latin square, 95 et seq. 

Lawrence, 193 
LD50, 250 et seq. 
least squares, 113, 146, 243 
likelihood, 205, 223, 243 

log-, 206, 223 

line, straight, 109, 132 
linearity, 129 

linkage, 181 et seq., 213, 215, 217 
218, 220 


main effect, 70, 183 
maize, 25, 28 

Mather, 38. 55, 79, 156, 165, 171, 
191, 193, 198, 218, 222, 223 
maximum likelihood, 113, 205 ei 
seq., 221, 222, 243 
mean, 26, 28, 30 et seq., 36, 38, 40, 
54, 55, 166 

weighted-, 167 

mean deviation, 27 
mean square, 32, 68, 69, 192 
measurements, 48, 174 
median, 26 
mice, 38, 223 

misclassification, 153, 158, 159 
mode, 26 
moments, 35 

multinomial, 18, 26, 195, 206 
Newell, 193 

normal curve, 25 et seq., 221, 240 

-deviate, 42, 43, 50, 51, 174, 

231, 240 

observations, 10, 204 
orthogonal fimctions, 64, 65, 72 

-polynomials, 133 et seq. 

-treatments, 95 

ouabain, 245 

Papaver, 183, 187 
parameter, 10, 28, 33, 115, 148, 
188, 204 

partition, 61 et seq., 139, 180 et seq., 
193, 239 
Pearson, 12, 46 
permutations, 16 
Pharbitis, 183 
phenylthiocarbamide, 196 
Philp, 183, 187, 190, 194 
plating, 36 

Poisson series, 35 et seq., 206, 239 
population, 10, 11, 204 

super-, 11, 204 

potatoes, 97, 116 
Powers, 72 

precision, 107, 166, 205, 216, 230 
Primula sinensis, 14, 18, 25, 193, 
213 

probability, 14, 19 et seq., 29, 41, 
205 

compound-, 15 et seq. 

fiducial-, 232 

inverse-, 204 

probit, 242 et seq. 



INDEX 


267 


quadrat. 30 
quadratio curve. 132 

radish, 235 
Rana pipiens, 245 
random numbers, 23 

-sampling. 23, 28, 94 

randomized block, 95, 102 
recombination, 203, 227 
regression, 109 et seq., 160, 242 

-coefficient, 113, 119 

multiple -, 146 et 9eq., 152. 

167 

partial —■' , 167 

polynomial - —, 129, 133 et eeq.t 

152 

weighted-, 245 et seq. 

restraints, 95, 99, 101 
rigour, 11, 49, 232 
Roberts, 161 
rye, 171 

sample, 10 

sampling error, 10, 11, 28, 41, 46, 67, 
76, 87, 113 

segregation, 181 et seq.^ 222 
selection, 23, 110 
self-pollination, 181, 227 
Shippy, 89 

sienificance, 21 et seq.t 51 

test of-. 10, 19. 21. 23, 24, 41 

et seq.t 69, 87, 165, 203, 221, 
231, 232 

Sirks, 50, 51, 175, 189 
skewness, 35 
Smith, 245 
Snedecor, 44, 198 

standard deviation, 27, 28, 30 et seq., 
40, 43. 50. 52, 166 
-error, 50, 78 


(Lim IQBAL LIBRARY 



49932 


statistic, 10, 35 

consistent-, 213 

efficient-, 213, 231 

inefficient-, 213 e< seq. 

sufficient-, 213, 231 

Steward, 116 

Student, 12, 43 

sum, v€iriance of, 54 

sum of cross products, 54, 64 

-squares, 31, 36, 39, 46, 61 

et seq., 66, 69 

I, 43, 44, 46 €f seq., 69, 78, 85, 174, 
232 

taxonomic data, 152, 159 
Tedin, 44 

tomato, 55, 65, 79, 129 
toxicology, 239 et seq. 
transformation, 109, 234 et seq. 
trout, 123 

uncertainty, 11, 232 

variance. 32, 35. 36, 38, 40, 60, 52 
et seq., 166, 231 

■-formula, 216 

-ratio, 44 et seq., 78, 85, 192 

variate, dependent, 110, 160 
-, independent, 110, 160 

Washboum, 123 

weight, 166, 225, 243, 248, 249 

Winton, de, 213 

Wishart, 104 

working mean, 31 et seq. 

Yates. 35. 43, 44, 97, 108, 138, 175, 
234, 242, 244, 245, 247 

z, 44 et seq., 78, 85, 164, 167, 168, 
170, 174, 192 


, < 6 . 7. fe-f 




